[00:08:36] (03PS47) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [00:09:32] (03PS48) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [00:11:47] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Seksen) So based purely on Krinkle's comment above it does appear that this would likely be a problem with the parser cache. Adding ?123 query string into th... [00:12:04] (03CR) 10Raymond Ndibe: "ignore the last three patches. I was trying to fix a problem I introduced when I did git pull" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [00:13:28] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10colewhite) [00:17:59] (03CR) 10Raymond Ndibe: "thanks David for working on this. this will make testing easier than it currently is" [puppet] - 10https://gerrit.wikimedia.org/r/867566 (owner: 10David Caro) [00:25:46] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle) @Seksen When browsing with a login session, you do still enjoy the performance benefit of the ParserCache, this is appl... [00:25:49] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle) [00:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:45:46] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2036.codfw.wmnet with OS bullseye [00:46:59] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2037.codfw.wmnet with OS bullseye [01:00:39] (03PS1) 10Cwhite: logstash: heavily restrict mediawiki http accesslog during initial onboarding [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) [01:01:02] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage [01:02:24] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage [01:04:08] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage [01:05:12] (03PS1) 10Eevans: Migrate echostore & sessionstore staging to new cassandra-dev cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/867733 (https://phabricator.wikimedia.org/T324113) [01:06:04] (03PS1) 10Cwhite: site: assign role logging::opensearch::data to logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/867631 (https://phabricator.wikimedia.org/T321335) [01:06:40] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage [01:17:54] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [01:18:18] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2036.codfw.wmnet with OS bullseye [01:22:01] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2037.codfw.wmnet with OS bullseye [01:22:12] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [01:27:54] (03CR) 10RLazarus: slo_dashboards: dynamic slo dashboard panels (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [01:34:00] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [01:35:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [01:40:15] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 203 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:51:04] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:54:40] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 188 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:55:15] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:24] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:10:15] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:15] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:48] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:43:26] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:00:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:29:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [04:42:21] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Ladsgroup) [05:54:45] (03PS1) 10Ladsgroup: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T312666) [05:55:13] (03PS2) 10Ladsgroup: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) [05:55:50] (03PS3) 10Ladsgroup: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) [06:47:12] (03PS1) 10Marostegui: misc.my.cnf, production.my.cnf: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867909 [06:47:54] (03PS1) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) [06:48:16] (03PS1) 10Raymond Ndibe: tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) [06:48:26] (03CR) 10Marostegui: [C: 03+2] misc.my.cnf, production.my.cnf: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867909 (owner: 10Marostegui) [06:50:17] (03PS1) 10Marostegui: mariadb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867912 [06:50:58] (03CR) 10Marostegui: [C: 03+2] mariadb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867912 (owner: 10Marostegui) [07:12:27] (03CR) 10Hashar: [C: 03+2] wm-checks-api: show processor prototype name on error [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867648 (owner: 10Hashar) [07:12:37] (03CR) 10Hashar: [C: 03+2] wm-checks-api: parse PipelineLib messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867656 (owner: 10Hashar) [07:13:00] (03Merged) 10jenkins-bot: wm-checks-api: show processor prototype name on error [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867648 (owner: 10Hashar) [07:13:06] (03Merged) 10jenkins-bot: wm-checks-api: parse PipelineLib messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867656 (owner: 10Hashar) [07:28:36] !log phedenskog@deploy1002 Started deploy [performance/navtiming@7ba179f]: (no justification provided) [07:28:44] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@7ba179f]: (no justification provided) (duration: 00m 08s) [08:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:07:42] (03PS1) 10Hashar: scap: disable git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) [08:15:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: adapt releases to the changes upstream in puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/868005 [08:17:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This will need for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868005 to be merged afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [08:17:45] (03CR) 10Hashar: "I tried to cherry pick it on the beta cluster Puppet master and it has the same issue since the repository is owned by "gitpuppet". I have" [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [08:17:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867706 (https://phabricator.wikimedia.org/T324696) (owner: 10JHathaway) [08:18:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin) [08:24:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:24:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:25:53] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068 [08:25:57] T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068 [08:26:04] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068 (duration: 00m 11s) [08:29:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:30:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:33:53] I am going to restart Gerrit for a plugin upgrade [08:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:37:43] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@353573b]: HDFS usage dataset pipeline deployment without superuser TEST [airflow-dags@353573b] [08:37:54] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@353573b]: HDFS usage dataset pipeline deployment without superuser TEST [airflow-dags@353573b] (duration: 00m 10s) [08:39:16] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@353573b]: HDFS usage dataset pipeline deployment without superuser [airflow-dags@353573b] [08:39:29] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@353573b]: HDFS usage dataset pipeline deployment without superuser [airflow-dags@353573b] (duration: 00m 13s) [08:39:47] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068 [08:39:50] T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068 [08:39:56] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068 (duration: 00m 09s) [08:41:59] !log Restarted Gerrit for a plugin update [08:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:08] this time it stopped almost instantly [08:52:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) a:03elukey [09:00:05] hashar and ^demon: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T0900). [09:01:58] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [09:02:55] (03PS1) 10JMeybohm: calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191) [09:02:57] (03PS1) 10JMeybohm: calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191) [09:03:48] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:04:48] (03CR) 10Elukey: [C: 03+1] calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [09:05:06] (03CR) 10Elukey: [C: 03+1] calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [09:06:01] oh the tran [09:06:03] train [09:07:25] I need to check the overnight logs first [09:07:39] I had some side work to do this morning [09:11:08] (03CR) 10JMeybohm: [C: 03+2] calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [09:11:11] (03CR) 10JMeybohm: [C: 03+2] calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [09:13:50] (03PS1) 10Ladsgroup: search: Avoid setting height in search thumbnails [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868046 (https://phabricator.wikimedia.org/T322621) [09:13:57] jouncebot: nowandnext [09:13:57] For the next 1 hour(s) and 46 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T0900) [09:13:57] In 4 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400) [09:14:15] hashar: I quickly backport something [09:14:20] (03CR) 10Ladsgroup: [C: 03+2] search: Avoid setting height in search thumbnails [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868046 (https://phabricator.wikimedia.org/T322621) (owner: 10Ladsgroup) [09:15:50] Amir1: please do ;) [09:16:16] I am digging in one of the error I have missed yesterday night [09:16:31] (03Merged) 10jenkins-bot: calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [09:16:33] (03Merged) 10jenkins-bot: calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [09:20:30] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:21:09] I think I will block the train [09:21:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6001.wikimedia.org [09:21:19] there is an error happening since yesterday and I don't know the impact [09:21:36] beside that it refers to parsoid which sounds scary [09:23:08] that is possibly related to what Daniel is doing, I suggest bringing it up in restbase-sunset in slack [09:23:18] (03CR) 10Elukey: [C: 03+1] k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [09:25:09] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:26:52] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet [09:27:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6001.wikimedia.org [09:27:36] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:27:46] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:27:55] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:28:06] (03Merged) 10jenkins-bot: search: Avoid setting height in search thumbnails [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868046 (https://phabricator.wikimedia.org/T322621) (owner: 10Ladsgroup) [09:28:20] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:28:25] Amir1: ah thank you, doing so [09:28:36] I just did :D [09:29:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5002.wikimedia.org [09:30:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-corp2001.wikimedia.org [09:31:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:34:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1001.eqiad.wmnet [09:35:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp2001.wikimedia.org [09:37:25] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) >>! In T188561#8464256, @DBu-WMF wrote: > @Vgutierrez is there anything left to do so that we can move forward on this task? P... [09:37:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-corp1001.wikimedia.org [09:39:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5002.wikimedia.org [09:39:50] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1002.eqiad.wmnet [09:40:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4003.wikimedia.org [09:41:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp1001.wikimedia.org [09:45:05] (03CR) 10Jelto: "thanks for the detailed review! I uploaded a new patchset." [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:45:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1002.eqiad.wmnet [09:46:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4003.wikimedia.org [09:46:45] !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.14/includes/search/SearchResultThumbnailProvider.php: Backport: [[gerrit:868046|search: Avoid setting height in search thumbnails (T322621)]] (duration: 08m 07s) [09:46:48] T322621: Use standard thumbsizes in modern vector search - https://phabricator.wikimedia.org/T322621 [09:47:08] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1003.eqiad.wmnet [09:53:26] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [09:54:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1003.eqiad.wmnet [09:54:42] (03CR) 10David Caro: [C: 03+1] "LGTM 👍" [puppet] - 10https://gerrit.wikimedia.org/r/867594 (owner: 10Slyngshede) [09:55:34] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging-etcd2001.codfw.wmnet [09:55:43] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=no; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [09:56:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [09:59:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging-etcd2001.codfw.wmnet [10:00:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [10:01:35] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:03:23] (03CR) 10Majavah: [C: 04-1] base::cloud_production: introduce new profile (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:03:26] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM, thanks again!" [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [10:04:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [10:06:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3005.wikimedia.org [10:07:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10aborrero) 05Resolved→03Open a:05Cmjohnson→03cmooney Reopening until switch changes are made by @cmooney [10:08:29] (03PS5) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [10:08:49] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [10:10:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3005.wikimedia.org [10:10:50] (03CR) 10David Caro: [C: 03+1] base::cloud_production: introduce new profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:11:29] (03PS6) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [10:11:48] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [10:12:11] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging-etcd2002.codfw.wmnet [10:14:52] (03PS1) 10Kosta Harlan: NewImpact: Add log event for clicking suggested edits button [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868047 (https://phabricator.wikimedia.org/T325041) [10:16:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging-etcd2002.codfw.wmnet [10:17:28] (03CR) 10Jbond: "adding brain who knows the environment better then i" [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [10:17:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2002.codfw.wmnet [10:17:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [10:18:31] (03PS1) 10Jelto: gitlab_runner: add trusted tag to Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) [10:19:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging-etcd2003.codfw.wmnet [10:20:48] (03CR) 10Jbond: [C: 03+2] "lgtm merging thanks" [puppet] - 10https://gerrit.wikimedia.org/r/867579 (owner: 10Hashar) [10:21:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2002.codfw.wmnet [10:23:23] (03PS1) 10Jelto: gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069) [10:23:41] (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:24:18] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [10:24:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [10:25:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet [10:25:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging-etcd2003.codfw.wmnet [10:28:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1001.eqiad.wmnet [10:28:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet [10:29:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6001.wikimedia.org [10:30:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moscovium.eqiad.wmnet [10:30:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1001.eqiad.wmnet [10:31:12] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1002.eqiad.wmnet [10:31:55] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=no; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [10:33:36] (03PS1) 10Jcrespo: icinga: Make the punctuation error optional on check [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) [10:33:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6001.wikimedia.org [10:34:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moscovium.eqiad.wmnet [10:34:57] 10SRE, 10Traffic: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) 05Open→03In progress After an initial check this seems to be an issue on Varnish, ATS sets `X-Cache-Int` to `miss`: ` vgutierrez@cp6003:~$ curl -H 'Host: upload.wiki... [10:35:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1002.eqiad.wmnet [10:35:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5001.wikimedia.org [10:35:50] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1003.eqiad.wmnet [10:36:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Thanks @Dzahn. We should probably add that final step to the [hardware troubleshooting runbook](https://wikitech.wikimedia.org/wiki/SRE/Dc-ope... [10:37:07] 10SRE, 10Observability-Alerting, 10observability, 10Patch-For-Review: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10jcrespo) >>! In T317169#8240929, @Dzahn wrote: > After pondering this a bit more I now think the _actual fix_ would be if Wikipedia and other projec... [10:39:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1003.eqiad.wmnet [10:40:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5001.wikimedia.org [10:40:11] (03CR) 10Jcrespo: "This should fix https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=en.wikibooks.org&service=Ensure+legal+html+en.wb" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [10:41:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4001.wikimedia.org [10:43:35] (03PS24) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [10:43:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Added documentation to avoid forgetting this step, DC-Ops feel free to revert or ask me to move it elsewhere if you feel it shouldn't be there. [10:43:56] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [10:46:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4001.wikimedia.org [10:49:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3001.wikimedia.org [10:50:45] (03CR) 10Volans: "replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:52:04] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1001.eqiad.wmnet [10:54:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3001.wikimedia.org [10:54:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2003.wikimedia.org [10:54:33] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet [10:58:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1001.eqiad.wmnet [10:58:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2003.wikimedia.org [10:59:22] (03CR) 10David Caro: [C: 03+1] base::cloud_production: introduce new profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:59:44] (03PS4) 10Volans: cumin::cloud_master: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) [10:59:48] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:00:12] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet [11:00:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1003.wikimedia.org [11:00:40] !log installing dpkg bugfix updates from Bullseye point release [11:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:33] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet [11:04:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1003.wikimedia.org [11:09:06] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet [11:09:15] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2002.codfw.wmnet [11:10:04] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1002.eqiad.wmnet [11:10:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [11:11:04] (03PS25) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [11:11:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:17:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1002.eqiad.wmnet [11:18:28] (03PS1) 10Vgutierrez: varnish: Reproduce T324956 in a VTC test [puppet] - 10https://gerrit.wikimedia.org/r/868043 (https://phabricator.wikimedia.org/T324956) [11:18:47] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1003.eqiad.wmnet [11:19:11] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ml-cache2002.codfw.wmnet [11:20:00] PROBLEM - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is CRITICAL: connect to address 10.192.16.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:20:00] PROBLEM - cassandra-a SSL 10.192.16.190:7001 on ml-cache2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:20:02] PROBLEM - cassandra-a service on ml-cache2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:20:02] PROBLEM - Check systemd state on ml-cache2002 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:18] RECOVERY - Check systemd state on ml-cache2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:28] RECOVERY - cassandra-a SSL 10.192.16.190:7001 on ml-cache2002 is OK: SSL OK - Certificate ml-cache2002-a valid until 2024-06-15 08:50:24 +0000 (expires in 548 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:22:30] RECOVERY - cassandra-a service on ml-cache2002 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:26:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1003.eqiad.wmnet [11:26:30] RECOVERY - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is OK: TCP OK - 0.032 second response time on 10.192.16.190 port 9042 https://phabricator.wikimedia.org/T93886 [11:26:55] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2003.codfw.wmnet [11:29:56] (03PS26) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [11:30:18] (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [11:31:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:32:10] 10SRE, 10Traffic, 10Patch-For-Review: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) 05In progress→03Stalled A [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/868043/1/modules/varnish/files/tests/text/33-x-cache-status.v... [11:32:30] (03PS1) 10David Caro: Allow overriding the cookbooks module name [software/spicerack] - 10https://gerrit.wikimedia.org/r/868067 (https://phabricator.wikimedia.org/T319436) [11:33:25] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:34:51] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2003.codfw.wmnet [11:37:21] (03CR) 10CI reject: [V: 04-1] Allow overriding the cookbooks module name [software/spicerack] - 10https://gerrit.wikimedia.org/r/868067 (https://phabricator.wikimedia.org/T319436) (owner: 10David Caro) [11:38:19] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [11:38:25] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:42:12] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [11:42:24] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [11:46:11] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [11:49:57] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [11:51:56] (03PS1) 10Marostegui: parsercache.my.cnf.erb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/868069 [11:52:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10cmooney) Ok the ports are reconfigured now if you want to give it another shot @Andrew [11:52:29] (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf.erb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/868069 (owner: 10Marostegui) [11:54:46] 10SRE, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10STran) [11:55:28] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [11:55:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38763/console" [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [11:57:27] (03CR) 10Majavah: [V: 03+1 C: 03+1] "looks good, awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [11:58:09] 10SRE, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10STran) I'm aware this is a duplicate of {T290917} but I made it anyway because: - afaik, the the scope of security-api has changed (for now). Whatever's being implemented is for IPInfo's spec... [11:58:47] !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes101[123].eqiad.wmnet [11:59:17] (03PS1) 10Majavah: openstack::haproxy::site: don't provision backend FW rules [puppet] - 10https://gerrit.wikimedia.org/r/868070 [11:59:20] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [11:59:51] (03CR) 10Jbond: [C: 03+1] "lgtm some minor comments questions inline but nothing blocking" [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [12:00:33] (03CR) 10Majavah: [V: 03+1 C: 03+1] cloudlb: introduce role skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [12:00:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:07:24] jouncebot: nowandnext [12:07:24] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [12:07:24] In 1 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400) [12:07:33] (03CR) 10Ladsgroup: [C: 03+2] Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [12:07:58] (03PS1) 10Majavah: labstore: nfs-mounts: add dumps for qrank [puppet] - 10https://gerrit.wikimedia.org/r/868071 (https://phabricator.wikimedia.org/T324952) [12:08:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [12:08:40] (03Merged) 10jenkins-bot: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [12:09:13] (03PS1) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) [12:09:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38765/console" [puppet] - 10https://gerrit.wikimedia.org/r/868071 (https://phabricator.wikimedia.org/T324952) (owner: 10Majavah) [12:09:40] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:867740|Externallinks: Set Persian Wikiquote to WRITE BOTH (T321662)]] [12:09:44] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [12:11:22] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10hnowlan) [12:11:25] !log disable puppet fleet wide to preform server reboots [12:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:44] (03CR) 10CI reject: [V: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [12:14:05] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb1002.eqiad.wmnet [12:14:23] (03PS2) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) [12:18:14] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1001.eqiad.wmnet [12:19:49] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki2001.codfw.wmnet [12:20:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: nfs-mounts: add dumps for qrank [puppet] - 10https://gerrit.wikimedia.org/r/868071 (https://phabricator.wikimedia.org/T324952) (owner: 10Majavah) [12:20:37] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb2002.codfw.wmnet [12:23:07] (03CR) 10Marostegui: "I would prefer something a bit more meaningful than b1, my first reaction was related to PDUs/racks :)" [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [12:25:42] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb1002.eqiad.wmnet [12:26:51] 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10jbond) p:05Triage→03Medium [12:26:55] (03PS1) 10Volans: config: allow to spcify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 [12:26:57] (03PS1) 10Hnowlan: thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196) [12:27:04] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host pki2001.codfw.wmnet [12:27:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Ladsgroup) This is not really user-impacting, specially given that mw-on-k8s is on test2wiki only but I think it should show up in next week's Tech news regardle... [12:28:53] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:867740|Externallinks: Set Persian Wikiquote to WRITE BOTH (T321662)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [12:28:57] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [12:29:48] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1001.eqiad.wmnet [12:30:07] (03CR) 10Clément Goubert: [C: 03+1] thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:30:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2002.codfw.wmnet [12:35:04] (03PS7) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [12:35:24] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [12:36:05] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [12:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:37:06] (03CR) 10Hnowlan: [C: 03+2] thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:38:03] (03PS1) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [12:38:58] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:867740|Externallinks: Set Persian Wikiquote to WRITE BOTH (T321662)]] (duration: 29m 18s) [12:39:02] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [12:41:41] (03Merged) 10jenkins-bot: thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:42:39] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [12:44:38] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [12:47:23] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [12:51:44] (03PS2) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [12:53:53] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10akosiaris) >>! In T290536#8466377, @Ladsgroup wrote: > - Inviting tech users to test our the new infra and let us know of issues early on. A related note. 2 th... [12:54:01] (03PS8) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [12:54:18] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [12:55:44] (03Abandoned) 10David Caro: Allow overriding the cookbooks module name [software/spicerack] - 10https://gerrit.wikimedia.org/r/868067 (https://phabricator.wikimedia.org/T319436) (owner: 10David Caro) [12:55:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 T325154', diff saved to https://phabricator.wikimedia.org/P42692 and previous config saved to /var/cache/conftool/dbconfig/20221214-125544-marostegui.json [12:55:49] T325154: Clean up unix_socket flag in my.cnf - https://phabricator.wikimedia.org/T325154 [12:56:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb1003.eqiad.wmnet [12:56:42] (03PS1) 10Muehlenhoff: Don't install quickstack on Bookworm, revisit later [puppet] - 10https://gerrit.wikimedia.org/r/868078 (https://phabricator.wikimedia.org/T321783) [12:58:10] (03PS2) 10Muehlenhoff: Don't install quickstack on Bookworm, revisit later [puppet] - 10https://gerrit.wikimedia.org/r/868078 (https://phabricator.wikimedia.org/T321783) [12:59:07] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [12:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P42693 and previous config saved to /var/cache/conftool/dbconfig/20221214-125928-root.json [12:59:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 T325154', diff saved to https://phabricator.wikimedia.org/P42694 and previous config saved to /var/cache/conftool/dbconfig/20221214-125950-marostegui.json [13:01:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 5%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42695 and previous config saved to /var/cache/conftool/dbconfig/20221214-130119-root.json [13:02:11] (03PS1) 10Marostegui: db_inventory.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868079 (https://phabricator.wikimedia.org/T325154) [13:06:41] (03CR) 10Marostegui: [C: 03+2] db_inventory.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868079 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [13:08:34] (03PS1) 10Hnowlan: thumbor: increase cpu limit, reduce workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) [13:09:28] (03PS3) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [13:14:22] (03CR) 10Clément Goubert: [C: 03+1] "I suppose you'll test raising the number of replicas separately?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [13:14:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P42696 and previous config saved to /var/cache/conftool/dbconfig/20221214-131433-root.json [13:16:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42697 and previous config saved to /var/cache/conftool/dbconfig/20221214-131624-root.json [13:17:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38766/console" [puppet] - 10https://gerrit.wikimedia.org/r/868070 (owner: 10Majavah) [13:27:41] (03CR) 10David Caro: [C: 03+1] "This works for me 👍" [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (owner: 10Volans) [13:27:46] (03PS1) 10Marostegui: sanitarium_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868083 (https://phabricator.wikimedia.org/T325154) [13:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P42699 and previous config saved to /var/cache/conftool/dbconfig/20221214-132938-root.json [13:29:41] (03PS1) 10Ladsgroup: Parsoid: Default parsoid version to "0.0.0" for unsupported models [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137) [13:29:52] jouncebot: nowandnext [13:29:52] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [13:29:52] In 0 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400) [13:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42700 and previous config saved to /var/cache/conftool/dbconfig/20221214-133129-root.json [13:32:13] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] Adding ihurbain to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin) [13:37:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ihurbain to parsoid-test-roots - https://phabricator.wikimedia.org/T325080 (10akosiaris) 05Open→03Resolved I guess all that's left for me as a clinic duty person is to merge the change and resolve the task. Done and done. Thanks everyone! @ihurbain pl... [13:38:05] (03CR) 10Cathal Mooney: [C: 04-1] Example strategy for marking DSCP with ferm and puppet integration (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [13:42:27] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10akosiaris) @Fuzzy, did the updated permissions work out ok? Can we resolve this task? [13:44:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P42701 and previous config saved to /var/cache/conftool/dbconfig/20221214-134443-root.json [13:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42702 and previous config saved to /var/cache/conftool/dbconfig/20221214-134634-root.json [13:55:21] (03CR) 10Ladsgroup: [C: 03+2] Parsoid: Default parsoid version to "0.0.0" for unsupported models [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137) (owner: 10Ladsgroup) [13:59:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P42703 and previous config saved to /var/cache/conftool/dbconfig/20221214-135948-root.json [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:00:10] o/ [14:00:19] (03CR) 10Marostegui: [C: 03+2] sanitarium_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868083 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [14:00:20] yup, looks like nothing to do ^^ [14:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42704 and previous config saved to /var/cache/conftool/dbconfig/20221214-140139-root.json [14:03:51] (03PS1) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085 [14:04:26] (03PS2) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't mainain [alerts] - 10https://gerrit.wikimedia.org/r/868085 [14:04:31] (03PS3) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't maintain [alerts] - 10https://gerrit.wikimedia.org/r/868085 [14:06:18] (03CR) 10CI reject: [V: 04-1] team-data-persistence: Stop alerting on dbs the team doesn't maintain [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup) [14:06:44] (03CR) 10Marostegui: [C: 03+1] team-data-persistence: Stop alerting on dbs the team doesn't maintain [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup) [14:08:15] (03PS4) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085 [14:09:39] (03CR) 10Ladsgroup: [C: 03+2] team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup) [14:09:55] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri) [14:10:01] (03PS2) 10Volans: config: allow to specify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168) [14:10:09] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DBu-WMF) @greg can we move forward and turn click-tracking back on in Acoustic? [14:10:44] (03Merged) 10jenkins-bot: Parsoid: Default parsoid version to "0.0.0" for unsupported models [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137) (owner: 10Ladsgroup) [14:10:54] (03Merged) 10jenkins-bot: team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup) [14:11:02] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri) [14:11:18] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri) [14:11:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137) (owner: 10Ladsgroup) [14:11:45] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wcqs1003.eqiad.wmnet [14:11:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wcqs1003.eqiad.wmnet [14:11:52] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:868048|Parsoid: Default parsoid version to "0.0.0" for unsupported models (T325137)]] [14:11:55] T325137: UnexpectedValueException: Invalid version string "" - https://phabricator.wikimedia.org/T325137 [14:13:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10bking) Removed downtime and repooled WCQS as it sounds like reseating the hard drives may have fixed it. @Jclark-ctr let us know if you hear anythi... [14:13:42] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet [14:13:43] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:868048|Parsoid: Default parsoid version to "0.0.0" for unsupported models (T325137)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:14:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P42705 and previous config saved to /var/cache/conftool/dbconfig/20221214-141453-root.json [14:15:27] (03PS1) 10Volans: cookbooks: remote top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) [14:15:28] (03PS1) 10Volans: cookboos.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 [14:15:45] (03PS2) 10Volans: cookbooks: remote top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) [14:16:41] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [14:16:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42706 and previous config saved to /var/cache/conftool/dbconfig/20221214-141644-root.json [14:17:02] (03CR) 10Volans: [C: 04-1] "-1 for now, depends on the Spicerack release with the related patch" [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [14:17:07] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri) 05Open→03In progress p:05Triage→03Medium [14:17:34] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri) a:03Volans [14:17:44] (03PS9) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [14:19:32] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38767/console" [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [14:20:03] (03PS10) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [14:20:04] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:868048|Parsoid: Default parsoid version to "0.0.0" for unsupported models (T325137)]] (duration: 08m 12s) [14:20:08] T325137: UnexpectedValueException: Invalid version string "" - https://phabricator.wikimedia.org/T325137 [14:20:26] (03CR) 10Clément Goubert: [C: 03+1] mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [14:21:52] (03CR) 10Clément Goubert: [C: 03+1] docker_registry_ha: add contint2002 to image builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/867708 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [14:22:28] (03CR) 10Elukey: [C: 03+2] kserve-inference: fix dependencies in Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/867600 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [14:22:58] (03CR) 10Clément Goubert: [V: 03+1 C: 03+1] mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [14:24:10] (03CR) 10Clément Goubert: [C: 03+1] logstash: heavily restrict mediawiki http accesslog during initial onboarding [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) (owner: 10Cwhite) [14:28:45] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche) [14:28:47] (03PS3) 10FNegri: cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [14:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P42707 and previous config saved to /var/cache/conftool/dbconfig/20221214-142958-root.json [14:30:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb1003.eqiad.wmnet [14:38:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10Jclark-ctr) 05Open→03Resolved [14:44:10] (03PS11) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [14:44:29] (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu) [14:44:48] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10SCherukuwada) Apologies, I don't have access to wikisource. @mpopov does probably. [14:44:52] (03PS4) 10Volans: cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) [14:44:54] (03PS2) 10Volans: cookboos.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 [14:45:24] (03CR) 10Ottomata: [C: 03+2] Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [14:46:07] (03PS12) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [14:47:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host xhgui2001.codfw.wmnet [14:50:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui2001.codfw.wmnet [14:51:54] (03CR) 10Hnowlan: [C: 03+2] thumbor: increase cpu limit, reduce workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:52:04] (03PS13) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) [14:52:34] (03CR) 10Hnowlan: [C: 03+2] thumbor: increase cpu limit, reduce workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:54:32] 10SRE, 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10decommission-hardware, 10serviceops-collab: decommission contint1001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T325102 (10Jclark-ctr) [14:54:40] 10SRE, 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10decommission-hardware, 10serviceops-collab: decommission contint1001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T325102 (10Jclark-ctr) 05Open→03Resolved [14:54:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host xhgui1001.eqiad.wmnet [14:55:10] (03PS3) 10JMeybohm: k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) [14:55:14] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1005.eqiad.wmnet [14:55:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add Sondes Ben Chagra to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/867706 (https://phabricator.wikimedia.org/T324696) (owner: 10JHathaway) [14:55:58] (03PS2) 10Alexandros Kosiaris: Add Sondes Ben Chagra to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/867706 (https://phabricator.wikimedia.org/T324696) (owner: 10JHathaway) [14:56:27] (03Merged) 10jenkins-bot: thumbor: increase cpu limit, reduce workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:57:01] (03CR) 10JMeybohm: [C: 03+1] Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [14:57:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1004.eqiad.wmnet [14:57:43] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [14:58:06] (03CR) 10JMeybohm: [C: 03+2] k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm) [14:58:25] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Planning, 10LDAP-Access-Requests, and 2 others: Grant Access to 'wmf' LDAP group for 'Sbenchagra' - https://phabricator.wikimedia.org/T324696 (10akosiaris) 05Open→03Resolved a:03akosiaris Thanks @jhathaway. user has been added to the WMF group. Resolvi... [14:58:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui1001.eqiad.wmnet [14:59:27] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) With {T270191} I've changed the zone of k8s ganeti workers to to their respective ganeti cluster and g... [15:00:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:00:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:00:26] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10akosiaris) Hello @sbassett, I see we are still missing some input here, any updates? [15:00:31] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1006.eqiad.wmnet [15:00:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1005.eqiad.wmnet [15:01:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:01:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:03:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10akosiaris) 05Stalled→03Invalid Since there are no updates on this task and it pretty much appears to be a duplicate of T324057, I 'll resolve as `invalid` (not mergin... [15:06:01] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) 05Open→03In progress [15:07:15] (03PS1) 10FNegri: Remove non-wmcs files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092 [15:07:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:07:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1006.eqiad.wmnet [15:08:08] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1007.eqiad.wmnet [15:09:02] (03CR) 10FNegri: [C: 04-2] "DO NOT MERGE. Will be pushed to the new Git repo." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092 (owner: 10FNegri) [15:10:04] (03CR) 10CI reject: [V: 04-1] Remove non-wmcs files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092 (owner: 10FNegri) [15:13:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10akosiaris) Hi @VirginiaPoundstone, which username do you use to log in to turnilo? [15:14:00] 10SRE, 10Traffic, 10Patch-For-Review: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) I've updated wikitech https://wikitech.wikimedia.org/w/index.php?title=Caching_overview&diff=2040756&oldid=2029875 to reflect that both X-Cache `hi... [15:14:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2001.codfw.wmnet [15:15:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED [15:15:26] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:16:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1007.eqiad.wmnet [15:17:08] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1008.eqiad.wmnet [15:17:21] 10SRE, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10LSobanski) [15:17:32] 10SRE, 10serviceops, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10LSobanski) [15:18:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host logstash1036.mgmt.eqiad.wmnet with reboot policy FORCED [15:18:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2001.codfw.wmnet [15:19:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet [15:20:04] 10SRE, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Platform Engineering (Needs Cleaning - Cassandra Operational): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329 (10LSobanski) [15:20:07] 10SRE, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Patch-For-Review: Automated invocation of Cassandra repair jobs - https://phabricator.wikimedia.org/T92355 (10LSobanski) [15:20:40] 10SRE, 10Patch-For-Review, 10Platform Engineering (Icebox): enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10LSobanski) 05Open→03Resolved a:03LSobanski I talked to Eric, this is no longer relevant. [15:21:59] 10SRE, 10serviceops, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10jijiki) 05Open→03Resolved a:03jijiki Given that this task was opened when the infra was completely different, I am bluntly closing this task. I am happy to re-open if/w... [15:23:03] (03PS1) 10Alexandros Kosiaris: admin: Add mnz to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/868098 (https://phabricator.wikimedia.org/T325072) [15:23:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1008.eqiad.wmnet [15:24:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet [15:24:22] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1009.eqiad.wmnet [15:24:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2003.codfw.wmnet [15:25:30] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:26:28] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10akosiaris) @Wangombe couple you please respond to the above comment? Many thanks! [15:26:49] 10SRE, 10Cassandra, 10Security: Cookbook to reboot cassandra nodes - https://phabricator.wikimedia.org/T288975 (10Eevans) [15:27:01] 10SRE, 10WMF-General-or-Unknown, 10Sustainability: Consider using Cassandra/restbase in place of external store - https://phabricator.wikimedia.org/T100705 (10LSobanski) 05Open→03Declined I'm closing this as Declined. Given its age and the changes in Restbase it likely needs a new problem statement befor... [15:28:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2003.codfw.wmnet [15:30:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "We really need to find a way to generate the egress list programmatically. Thanks for this fix though!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867707 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [15:30:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1009.eqiad.wmnet [15:30:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED [15:32:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logstash1036.mgmt.eqiad.wmnet with reboot policy FORCED [15:32:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2001.wikimedia.org [15:33:26] 10SRE: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10LSobanski) 05Open→03Resolved a:03LSobanski Considering the age of this task, we're probably safe to close it. Please reopen if you think otherwise. [15:34:32] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:34:44] (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:36:43] (03PS1) 10Cmjohnson: Adding logstash1036-37 to site.pp and netboot cfg [puppet] - 10https://gerrit.wikimedia.org/r/868107 (https://phabricator.wikimedia.org/T313849) [15:38:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2001.wikimedia.org [15:39:27] (03CR) 10Cmjohnson: [C: 03+2] Adding logstash1036-37 to site.pp and netboot cfg [puppet] - 10https://gerrit.wikimedia.org/r/868107 (https://phabricator.wikimedia.org/T313849) (owner: 10Cmjohnson) [15:40:37] (03PS1) 10Clément Goubert: wmnet: Add aux-k8s-ingress.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/868100 (https://phabricator.wikimedia.org/T325178) [15:41:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew) a:05cmooney→03Andrew [15:41:54] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Regarding DNS updates, I am going to paste the comment I linked to in my last comment below so all of the information is in t... [15:43:56] PROBLEM - Check systemd state on apt2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:18] (03CR) 10Eevans: [C: 03+2] Migrate echostore & sessionstore staging to new cassandra-dev cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/867733 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [15:44:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:44:58] (03CR) 10JMeybohm: [C: 04-1] "Looks good so far. I don't really like that it's called flink-*kubernetes*-operator because that's very obvious at this point, but probabl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:45:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [15:45:41] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:46:55] (03PS5) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 [15:47:07] (03CR) 10JMeybohm: [C: 04-1] flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:48:16] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:48:23] (03PS2) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625 [15:48:40] (03CR) 10Andrew Bogott: puppetmasters: cache cleanup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [15:48:56] (03CR) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott) [15:49:16] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:49:28] RECOVERY - Check systemd state on apt2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:33] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) [15:50:06] (03Merged) 10jenkins-bot: Migrate echostore & sessionstore staging to new cassandra-dev cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/867733 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [15:50:12] (03CR) 10Volans: [C: 03+2] config: allow to specify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [15:50:41] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:50:49] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [15:51:13] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [15:51:37] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) >>! In T323943#8466964, @akosiaris wrote: > Hello @sbassett, I see we are still missing some input here, any upda... [15:52:40] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:53:43] (03Merged) 10jenkins-bot: config: allow to specify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [15:54:05] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [15:54:14] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:55:28] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) As an update, we're hopeful of having the work for T299125 done by end of this quarter (with deployment early next now, given how close we are to the no-change window); that... [15:55:38] (03CR) 10Marostegui: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:58:28] (03PS1) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) [15:58:42] (03CR) 10Jbond: "thanks, and ping when ever for help with the puppet stuff" [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [15:59:04] (03CR) 10JHathaway: [V: 03+1] "looks good, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/868100 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [16:00:39] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:00:46] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:02:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host logstash1036.eqiad.wmnet with OS buster [16:02:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster [16:02:23] (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:03:24] (03Abandoned) 10FNegri: Remove non-wmcs files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092 (owner: 10FNegri) [16:04:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott) [16:05:41] (03CR) 10Majavah: [C: 04-1] Added some comments about where/how cloud hiera settings are applied (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott) [16:06:36] (03PS1) 10Vgutierrez: wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561) [16:06:52] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [16:10:26] (03CR) 10Volans: [C: 03+2] cumin::cloud_master: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:10:32] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1002.eqiad.wmnet [16:11:13] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [16:15:34] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1002.eqiad.wmnet [16:16:01] (CirrusSearchJobQueueBacklogTooBig) firing: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 1.63M messages - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [16:16:34] (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:16:58] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:17:02] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [16:17:57] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [16:19:20] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon2002.codfw.wmnet [16:19:59] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) I've created a CR https://gerrit.wikimedia.org/r/c/operations/dns/+/868103 to add the DNS records to the wikimedia.org DNS zone... [16:21:01] (CirrusSearchJobQueueBacklogTooBig) resolved: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 464k messages - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [16:21:34] (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:22:21] I will promote group 1 wikis to 1.40.0-wmf.14 in a few minutes (at 16:30 UTC) [16:22:44] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:23:32] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:23:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [16:24:15] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2002.codfw.wmnet [16:25:15] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog1001.eqiad.wmnet [16:25:56] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:27:25] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2028'] [16:28:22] PROBLEM - OpenSearch health check for shards on 9200 on logstash2028 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc9d1c82278: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [16:28:22] org/wiki/Search%23Administration [16:29:26] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:30:04] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:31:34] (KeyholderUnarmed) resolved: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:31:54] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:32:03] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:33:08] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:33:40] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1001.eqiad.wmnet [16:34:15] I am promoting group 1 wikis now [16:34:19] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868104 (https://phabricator.wikimedia.org/T320519) [16:34:21] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868104 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [16:35:09] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868104 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [16:35:24] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2028'] [16:36:52] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host dispatch-be1001.eqiad.wmnet [16:36:54] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2028'] [16:38:37] (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:40:24] PROBLEM - Host logstash2028 is DOWN: PING CRITICAL - Packet loss = 100% [16:40:35] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dispatch-be1001.eqiad.wmnet [16:41:56] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet [16:41:59] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2001.codfw.wmnet [16:42:35] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:42:59] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1003.eqiad.wmnet [16:43:12] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.14 refs T320519 [16:43:13] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host orespoolcounter1003.eqiad.wmnet [16:43:15] T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519 [16:43:43] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1003.eqiad.wmnet [16:43:43] (03CR) 10Marostegui: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:43:56] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash2028'] [16:44:06] RECOVERY - Host logstash2028 is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [16:44:11] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloudcumin1001 - https://phabricator.wikimedia.org/T323516 (10fnegri) 05Open→03Resolved [16:44:16] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:44:22] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumincloud2001 - https://phabricator.wikimedia.org/T323518 (10fnegri) 05Open→03Resolved [16:44:32] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:44:58] RECOVERY - OpenSearch health check for shards on 9200 on logstash2028 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 662, active_shards: 1464, relocating_shards: 4, initializing_shards: 0, unassigned_shards: 1, delayed_unassigned_sha [16:44:58] number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.93174061433447 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:45:59] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:46:15] (03PS1) 10Eevans: sessionstore: Update egress rules for staging database [deployment-charts] - 10https://gerrit.wikimedia.org/r/868126 (https://phabricator.wikimedia.org/T324113) [16:46:26] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10nskaggs) As someone without global root who has been a test case in the past for this, allowing wmcs* cookbook runs for a subset of user... [16:46:27] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2009.codfw.wmnet [16:47:16] (03Abandoned) 10FNegri: cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [16:47:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1003.eqiad.wmnet [16:47:51] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:47:59] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:48:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2001.codfw.wmnet [16:48:15] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2028.codfw.wmnet with OS bullseye [16:48:30] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2002.codfw.wmnet [16:48:48] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet [16:49:19] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [16:49:23] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10hnowlan) 05Open→03Resolved [16:49:41] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet [16:50:19] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.14 refs T320519 (duration: 07m 06s) [16:50:22] T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519 [16:50:54] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:51:29] hnowlan: ^^ [16:51:29] (03PS1) 10Daniel Kinzler: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 [16:51:37] (03CR) 10CI reject: [V: 04-1] Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 (owner: 10Daniel Kinzler) [16:52:20] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1004.eqiad.wmnet [16:52:22] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2009.codfw.wmnet [16:52:30] jayme: ack, thanks [16:52:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [16:53:23] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2008.codfw.wmnet [16:55:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2002.codfw.wmnet [16:55:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1004.eqiad.wmnet [16:56:38] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet [16:57:55] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus4001.ulsfo.wmnet [16:58:05] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:58:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1036.eqiad.wmnet with OS buster [16:58:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster executed with e... [16:58:37] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2003.codfw.wmnet [16:59:26] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2003.codfw.wmnet [17:01:57] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2029'] [17:02:25] PROBLEM - Checks that the airflow database for airflow research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:03:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2003.codfw.wmnet [17:03:53] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4001.ulsfo.wmnet [17:04:49] (03CR) 10JHathaway: [V: 03+1] "looks good to me, though I would like someone on traffic to weigh in, or someone with more familiarity on our ingress setup." [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [17:05:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2003.codfw.wmnet [17:05:37] RECOVERY - Checks that the airflow database for airflow research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:06:03] (03CR) 10Clément Goubert: [C: 03+2] wmnet: Add aux-k8s-ingress.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/868100 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [17:06:38] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2004.codfw.wmnet [17:06:44] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet [17:07:27] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2008.codfw.wmnet [17:07:41] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2007.codfw.wmnet [17:08:44] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [17:08:46] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2028.codfw.wmnet with reason: host reimage [17:08:48] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus5001.eqsin.wmnet [17:08:48] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2029'] [17:08:58] !log planet2002 - rebooting [17:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:42] !log planet1002 - rebooting [17:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2004.codfw.wmnet [17:10:49] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add aux-k8s-ingress VIP - cgoubert@cumin1001" [17:11:43] !log doc2001 - rebooting [17:11:43] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2029'] [17:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:46] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2028.codfw.wmnet with reason: host reimage [17:11:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add aux-k8s-ingress VIP - cgoubert@cumin1001" [17:11:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:12:58] !log https://doc.wikimedia.org - maybe a few seconds of downtime [17:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet [17:13:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes1011.eqiad.wmnet [17:13:55] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2007.codfw.wmnet [17:14:09] PROBLEM - Host logstash2029 is DOWN: PING CRITICAL - Packet loss = 100% [17:15:02] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10Volans) Indeed, I agree that we might need later on some more fine-tuned way to authorize things. That said the new cloudcumin setup wil... [17:15:13] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5001.eqsin.wmnet [17:16:26] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2006.codfw.wmnet [17:17:02] (03CR) 10Eevans: [C: 03+2] sessionstore: Update egress rules for staging database [deployment-charts] - 10https://gerrit.wikimedia.org/r/868126 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [17:18:30] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2029'] [17:18:43] RECOVERY - Host logstash2029 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [17:18:56] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus6001.drmrs.wmnet [17:19:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes1011.eqiad.wmnet [17:19:01] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2029.codfw.wmnet with OS bullseye [17:21:14] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2006.codfw.wmnet [17:21:32] (03Merged) 10jenkins-bot: sessionstore: Update egress rules for staging database [deployment-charts] - 10https://gerrit.wikimedia.org/r/868126 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [17:22:00] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2005.codfw.wmnet [17:22:13] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [17:22:39] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [17:24:19] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:25:07] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6001.drmrs.wmnet [17:27:29] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2005.codfw.wmnet [17:28:16] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet [17:33:47] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet [17:34:45] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2028.codfw.wmnet with OS bullseye [17:38:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [17:38:16] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2029.codfw.wmnet with reason: host reimage [17:41:25] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2029.codfw.wmnet with reason: host reimage [17:41:56] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/868139 [17:42:13] (03CR) 10Hokwelum: [C: 03+1] "Ariel and I looked at the dumps related file and from the PCC run, it looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) (owner: 10Effie Mouzeli) [17:44:57] (03CR) 10Andrew Bogott: [C: 03+1] scap: disable git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [17:45:07] !log disable puppet on all P:mediawiki::nutcracker hosts (killing nutcracker on mw) [17:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:43] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/868139 (owner: 10Volans) [17:48:24] (03CR) 10Majavah: [C: 04-1] "the code to manage the config file should be in the same place where the package is installed, imo" [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [17:48:59] (03PS5) 10Effie Mouzeli: mediawiki: Goodbye Nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) [17:49:11] (03PS6) 10Effie Mouzeli: mediawiki: Goodbye Nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) [17:49:13] (03CR) 10David Caro: tools-webservice: create /etc/toolforge/webservice.yaml with puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [17:49:46] effie: \o/ [17:49:50] kill kill kill [17:49:54] hehe [17:49:55] (03PS1) 10Volans: Upstream release v6.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/868142 [17:50:41] (03CR) 10Majavah: [C: 04-1] tools-webservice: create /etc/toolforge/webservice.yaml with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [17:51:43] jouncebot: now [17:51:43] No deployments scheduled for the next 1 hour(s) and 8 minute(s) [17:52:09] (03CR) 10David Caro: tools-webservice: create /etc/toolforge/webservice.yaml with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [17:52:19] duesen: which group are you targeting with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/868136 ? [17:52:22] thcipriani: o/ [17:52:37] current lay of the land: https://versions.toolforge.org/ [17:52:39] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Goodbye Nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) (owner: 10Effie Mouzeli) [17:53:47] (03CR) 10Majavah: [C: 04-1] tools-webservice: create /etc/toolforge/webservice.yaml with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [17:54:34] effie: congratulations on removing nutcracker, that's been around forever [17:54:44] must be a milestone as well [17:54:57] oh! nice, kudos :) [17:56:08] (03CR) 10Daniel Kinzler: [C: 03+2] "for deploy" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [17:56:40] (03CR) 10Volans: [C: 03+2] Upstream release v6.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/868142 (owner: 10Volans) [17:56:51] thcipriani: it's be 30 minutes until this merges I suppose --^^ [17:57:09] I guess we won't make it in time [17:57:24] duesen: bah, yeah, I guess that's true. Let's meet back here after and get it out? [17:58:10] I'm meeting with Subbu after... but I can verify on the side :) [17:58:21] It's not super urgent. It would just be nice to see that it works. [17:58:40] I guess we should deploy it when it's merged into the branch... [18:00:25] works for me. I'll keep an eye on it and ping you. [18:00:36] cool, thanks [18:01:09] !log uploaded spicerack_6.0.0 to apt.wikimedia.org bullseye-wikimedia [18:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:10] duesen: well. Looks like a lot of test failures :\ [18:03:51] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2029.codfw.wmnet with OS bullseye [18:04:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [18:06:00] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2027'] [18:06:22] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Hi all, I apologize for the latency, I will be working on this today. Thanks teammates, Kelton Hurd Wikimedia... [18:07:03] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet [18:08:18] PROBLEM - OpenSearch health check for shards on 9200 on logstash2027 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f0217406278: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [18:08:18] org/wiki/Search%23Administration [18:08:38] (03CR) 10CI reject: [V: 04-1] Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [18:09:53] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2001.codfw.wmnet with OS bullseye [18:11:30] RECOVERY - OpenSearch health check for shards on 9200 on logstash2027 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 9, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 662, active_shards: 1456, relocating_shards: 20, initializing_shards: 0, unassigned_shards: 9, delayed_unassigned_sha [18:11:30] number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.38566552901024 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:12:46] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2027'] [18:14:22] PROBLEM - nutcracker process on mw1447 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:14:32] PROBLEM - nutcracker socket on parse2002 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:14:34] please ignore the nutcracker alers [18:14:36] PROBLEM - nutcracker socket on parse2001 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:14:42] PROBLEM - nutcracker socket on mwdebug1002 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:14:42] they will be cleared soon [18:14:46] PROBLEM - nutcracker socket on mw1447 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:02] PROBLEM - nutcracker socket on mw1448 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:02] PROBLEM - nutcracker process on parse1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:04] PROBLEM - nutcracker process on mw1449 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:06] PROBLEM - nutcracker process on mwdebug1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:08] PROBLEM - nutcracker process on mw2271 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:08] PROBLEM - nutcracker process on mw1448 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:10] PROBLEM - nutcracker socket on mw2272 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:14] PROBLEM - nutcracker process on parse1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:14] PROBLEM - nutcracker socket on mw2374 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:16] PROBLEM - nutcracker socket on mw2271 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:16] PROBLEM - nutcracker socket on parse1001 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:16] PROBLEM - nutcracker process on parse2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:20] PROBLEM - nutcracker socket on mwdebug2002 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:28] PROBLEM - nutcracker socket on mw2376 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:15:28] PROBLEM - nutcracker socket on mwdebug2001 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [18:16:28] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2027'] [18:16:28] (03PS2) 10Daniel Kinzler: Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 [18:19:20] PROBLEM - Host logstash2027 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:53] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2027'] [18:23:18] (03PS1) 10Eevans: echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 [18:23:30] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2027.codfw.wmnet with OS bullseye [18:27:46] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:28:04] (03CR) 10Eevans: "Seeing the full list of nodes for RESTBase (our largest cluster) makes me wish that we allocated subnets for clusters like this. 😢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 (owner: 10Eevans) [18:28:36] (03PS2) 10Vgutierrez: wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561) [18:29:17] (03CR) 10Ssingh: [C: 03+1] wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561) (owner: 10Vgutierrez) [18:30:25] (03PS3) 10Vgutierrez: wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561) [18:31:58] (03CR) 10Vgutierrez: [C: 03+2] wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561) (owner: 10Vgutierrez) [18:32:35] (03PS2) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) [18:35:10] (03PS2) 10Raymond Ndibe: tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) [18:35:16] PROBLEM - Check systemd state on parse1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:47] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) DNS records are now live: `$ host -t cname links.email.wikimedia.org links.email.wikimedia.org is an alias for recp.mkt41.net.`... [18:36:28] (03CR) 10Raymond Ndibe: tools-webservice: create /etc/toolforge/webservice.yaml with puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [18:37:00] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2001.codfw.wmnet with reason: host reimage [18:37:10] (03PS1) 10Effie Mouzeli: cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147 [18:37:45] (03PS3) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) [18:38:51] (03CR) 10CI reject: [V: 04-1] cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147 (owner: 10Effie Mouzeli) [18:39:02] PROBLEM - Check systemd state on parse2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:58] (03PS4) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) [18:40:13] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2001.codfw.wmnet with reason: host reimage [18:40:55] (03PS1) 10Southparkfan: rsyslog: use ensure_resource for package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) [18:41:42] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38779/console" [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [18:42:41] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2027.codfw.wmnet with reason: host reimage [18:45:49] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2027.codfw.wmnet with reason: host reimage [18:48:57] (03CR) 10Dzahn: [C: 03+1] gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto) [18:49:32] (03PS2) 10Effie Mouzeli: cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147 [18:49:38] (03CR) 10Dzahn: [C: 03+1] "I like it, I have been confused by "trusted vs protected" myself in the past" [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto) [18:50:25] (03PS3) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) [18:50:47] (03CR) 10CI reject: [V: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [18:51:00] (03PS4) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) [18:52:10] (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [18:53:36] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Thank you, @Vgutierrez. I've alerted Acoustic support that the updates have been made and I will follow up here when they pro... [18:53:55] (03CR) 10Dzahn: "I am not sure, I feel like this might open it up to more changes like this in the future. Once the projects start drifting more we'd repea" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [18:57:50] thcipriani: I scheduled it fix for the regular backport window in two hours. [18:59:38] (03CR) 10Dzahn: [C: 03+1] "yea, merge it but I think it also needs some social contract how to deal with it in the future" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [19:00:05] hashar and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1900). [19:00:05] hashar and ^demon: Your horoscope predicts another unfortunate MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1900). [19:01:45] (03CR) 10Dzahn: [C: 03+1] "either way I think we should not keep this alert unless there is coordination between the 3 involved stakeholders, SRE/observability, lega" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [19:02:47] PROBLEM - Check systemd state on mw1449 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:38] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2027.codfw.wmnet with OS bullseye [19:11:39] PROBLEM - Disk space on aphlict1001 is CRITICAL: DISK CRITICAL - free space: / 667 MB (3% inode=91%): /tmp 667 MB (3% inode=91%): /var/tmp 667 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [19:12:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host logstash1037.eqiad.wmnet with OS buster [19:12:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1037.eqiad.wmnet with OS buster [19:14:24] (03CR) 10Dzahn: [C: 03+2] scap: add stanza for jenkins-ci and jenkins-releases deploy [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:15:14] sigh, I will look at disk space on aphlict1001 [19:15:17] re: alert above [19:15:18] (03CR) 10Effie Mouzeli: [C: 03+2] cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147 (owner: 10Effie Mouzeli) [19:15:23] mutante: was just about to ping you to look [19:15:44] RhinosF1: :) thanks [19:16:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Cmjohnson) @Jclark-ctr I am getting a media test failure for logstash1037, can you check the cable please logstash1037 F1 U26 Port 26 [19:16:41] mutante: see https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad%20prometheus%2Fops&orgId=1&viewPanel=28&from=now-7d&to=now, big spike yesterday [19:16:47] Maybe logs again? [19:17:07] Or a failed rotate [19:17:38] Similar spike 1st/2nd [19:17:41] RhinosF1: yes, it is [19:19:53] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host logstash1037.eqiad.wmnet with OS buster [19:19:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1037.eqiad.wmnet with OS buster executed with e... [19:21:02] !log aphlict1001 - :/var/log/aphlict# gzip aphlict.log.1 [19:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:32] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) Hi Alexandros. The permissions for he.wikisource.org were updated, but not for he.m.wikisource.org. Thanks. [19:27:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host logstash1036.eqiad.wmnet with OS buster [19:27:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster [19:30:52] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2001.codfw.wmnet with OS bullseye [19:30:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Cmjohnson) Jclark-ctr I am also getting a media test failure on logstash1036, the DAC cable may be plugged into the wrong port. [19:32:25] RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [19:33:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @jclark-ctr Can you try reseating the nic if that is possible [19:43:25] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:46:10] (03PS1) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [19:46:29] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [19:54:18] (03PS2) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [19:54:39] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [19:56:22] (03PS3) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [19:58:12] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [19:59:55] (03PS1) 10Gergő Tisza: UserEditTracker: Allow querying primary DB for edit timestamp [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868051 [20:00:17] (03PS1) 10Gergő Tisza: User impact: read edit count from primary db in save complete hook [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868052 (https://phabricator.wikimedia.org/T324930) [20:00:35] (03PS4) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [20:04:30] (03PS3) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625 [20:04:32] (03PS1) 10Andrew Bogott: Add wmcs::openstack::eqiad1::virt_ceph to new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868159 (https://phabricator.wikimedia.org/T313983) [20:05:23] (03CR) 10Andrew Bogott: [C: 03+2] Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott) [20:06:10] (03CR) 10Andrew Bogott: [C: 03+2] Add wmcs::openstack::eqiad1::virt_ceph to new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868159 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott) [20:06:31] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:11:10] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host logstash1036.eqiad.wmnet with OS buster [20:11:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster executed with e... [20:12:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [20:13:55] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [20:14:13] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye [20:14:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmn... [20:17:13] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:59] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1054.eqiad.wmnet with OS bullseye [20:20:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet w... [20:29:20] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye [20:29:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmn... [20:31:29] (03PS5) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [20:32:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew) I enabled virtualization in the bios processor settings for each of these hosts. [20:33:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bullseye [20:33:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet with OS bullseye [20:33:36] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS bullseye [20:33:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet with OS bullseye [20:36:42] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bullseye [20:36:44] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1058.eqiad.wmnet with OS bullseye [20:36:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1057.eqiad.wmnet with OS bullseye [20:36:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1058.eqiad.wmnet with OS bullseye [20:38:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1059.eqiad.wmnet with OS bullseye [20:38:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1060.eqiad.wmnet with OS bullseye [20:38:53] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS bullseye [20:38:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1059.eqiad.wmnet with OS bullseye [20:39:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1060.eqiad.wmnet with OS bullseye [20:39:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1061.eqiad.wmnet with OS bullseye [20:42:00] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [20:42:53] (03PS6) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [20:45:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [20:46:04] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [20:46:22] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [20:47:57] (03PS7) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [20:49:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [20:49:28] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [20:49:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [20:51:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [20:51:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [20:51:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [20:51:48] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1054.eqiad.wmnet with OS bullseye [20:51:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet with OS bullseye execut... [20:51:57] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [20:53:35] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1055.eqiad.wmnet with OS bullseye [20:53:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet with OS bullseye execut... [20:54:02] (03PS1) 10Andrew Bogott: Add hiera host defs for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868166 (https://phabricator.wikimedia.org/T313983) [20:54:07] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [20:54:29] (03CR) 10Andrew Bogott: [C: 03+2] Add hiera host defs for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868166 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott) [20:55:19] (03PS8) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [20:55:39] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [20:56:05] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bullseye [20:56:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye [20:56:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmn... [20:56:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmn... [20:56:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [20:57:29] (03PS9) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [20:58:50] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [20:58:51] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [20:59:30] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T2100). [21:00:04] duesen, subbu, kemayo, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:21] o/ [21:00:33] * TheresNoTime can deploy, are there any self-service patches? [21:00:42] o/ [21:00:52] o/ I can self-service, I added more patches than allowed [21:01:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [21:01:30] tgr: sure, I'll ping you when ready? [21:01:51] subbu: will start with yours [21:01:57] PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [21:02:42] Mine do collapse down okay - it's the same patch applied to .13 and .14, and a config patch that makes it actually do something. [21:02:43] sure. ty [21:03:18] (03PS1) 10Andrew Bogott: OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167 [21:05:20] (03PS10) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [21:05:36] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [21:06:04] (03PS2) 10Andrew Bogott: OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167 [21:06:58] (03PS11) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [21:07:32] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/868167/38790/" [puppet] - 10https://gerrit.wikimedia.org/r/868167 (owner: 10Andrew Bogott) [21:07:47] subbu: I'm helping my daughter with homework (learning sql, yay). So I'm around if anything comes up. [21:08:00] sounds good! :-) [21:09:49] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [21:09:52] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [21:12:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [21:14:03] subbu: your patch failed https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/11930/console#console-section-16 [21:14:25] Yes, i saw .. can you retry? [21:14:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [21:15:04] I am not sure why selenium tests would fail here .. i expect it is something transient. [21:15:18] subbu: ack will do [21:15:21] ty [21:16:20] subbu: if it's just that, worth forcing it through with a V+2? [21:16:24] (I think that works?) [21:16:59] TheresNoTime: why not recheck? [21:17:01] no, let us retry .. just in case. [21:17:10] sure :) [21:17:38] !log samtar@deploy1002 backport aborted: (duration: 15m 35s) [21:18:18] (03PS12) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [21:18:21] (03CR) 10Samtar: [C: 03+2] "deploy, retry" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [21:18:38] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [21:19:05] (03CR) 10CI reject: [V: 04-1] Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [21:19:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudvirt1056.eqiad.wmnet with OS bullseye [21:19:25] PROBLEM - ensure kvm processes are running on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:19:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet w... [21:19:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet w... [21:19:34] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudvirt1057.eqiad.wmnet with OS bullseye [21:19:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1057.eqiad.wmnet w... [21:19:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1057.eqiad.wmnet w... [21:19:50] (03CR) 10Samtar: [C: 03+2] "recheck" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [21:20:23] (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867619 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [21:20:25] (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867620 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [21:20:27] subbu: it was a growth experiments test that failed [21:20:36] (03PS13) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [21:21:24] tgr: in case it fails again, might be worth you looking ^ or do you have a flappy test? [21:22:01] PROBLEM - Check systemd state on mw2272 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:03] Kemayo: while that is retrying, I can do yours if you're available [21:22:11] TheresNoTime: Sure thing [21:22:48] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1061.eqiad.wmnet with OS bullseye [21:22:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1061.eqiad.wmnet w... [21:23:03] Error: element (".oo-ui-messageDialog-message") still not displayed after 5000ms [21:23:07] All selenium tests are flappy. Those weren't GrowthExperiments tests though. [21:23:28] tgr: /workspace/src/extensions/GrowthExperiments/node_modules/webdriverio/build/commands/browser/waitUntil.js:66:23 is the path it gave [21:23:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1058.eqiad.wmnet with OS bullseye [21:23:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1058.eqiad.wmnet w... [21:23:49] It failed on the Growth Experiments step [21:24:09] And all flappy sounds fun [21:24:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1059.eqiad.wmnet with OS bullseye [21:24:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1059.eqiad.wmnet w... [21:26:05] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Security: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813 (10LSobanski) [21:26:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1060.eqiad.wmnet with OS bullseye [21:27:19] (03CR) 10Cathal Mooney: [C: 04-1] "Not to be merged just an example." [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [21:27:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1060.eqiad.wmnet w... [21:29:37] Kemayo: oh maybe not, it's queued behind that core patch (: sorry [21:29:52] 🥲 [21:32:02] * TheresNoTime picked a great window to record with https://github.com/faressoft/terminalizer for docs (: [21:35:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS bullseye [21:35:53] PROBLEM - Check systemd state on mw1418 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet w... [21:36:55] This is the test that failed (VE toolbar special characters button): [21:36:58] https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/11930/artifact/log/Toolbar-should-open-special-characters-menu-2022-12-14T21-11-48-713Z.mp4 [21:37:11] the recording is not very informative though [21:37:51] PROBLEM - ensure kvm processes are running on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:38:40] Given that Daniel's patch is supposed to improve VE latencies ... that failure may potentially be pertinent ... but I expect it is just a flappy failure in reality. [21:39:00] it doesn't look like opening the special characters menu does a backend request, so it's probably unrelated [21:39:15] (but yeah VE failing on a VE fix patch is a bit scary) [21:39:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS bullseye [21:39:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet w... [21:39:30] RECOVERY - ensure kvm processes are running on cloudvirt1054 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:40:06] tgr but this is a cherry pick ... the original patch merged ... so there is also that. [21:40:09] failed selenium tests are repeated once, so it shouldn't be *that* flappy [21:40:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) @jbond I've uplaoded a separate patch (above) that makes a stab and working this closer to how we discussed earlier. It defi... [21:40:21] ok [21:40:39] RECOVERY - ensure kvm processes are running on cloudvirt1058 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:40:56] well, I guess it's easy enough to check that button on mwdebug [21:41:32] (03Merged) 10jenkins-bot: Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [21:41:35] (03Merged) 10jenkins-bot: VisualEnhancements: in some languages put an arrow by the reply button [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867619 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [21:41:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler) [21:41:39] well, it merged i think. :) [21:41:50] * subbu was watching zuul [21:42:06] * duesen curses at Selenium [21:42:17] !log samtar@deploy1002 Started scap: Backport for [[gerrit:868049|Parsoid: don't bypass ParserCache when using Title]] [21:44:05] !log samtar@deploy1002 samtar and daniel: Backport for [[gerrit:868049|Parsoid: don't bypass ParserCache when using Title]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:44:10] subbu: live on mwdebug ^ [21:44:22] ok .. will test. [21:45:10] let me check the GrowthExperiments test (that failed once, passed on auto-retry) [21:46:43] VE is still functional .. so looks good to continue. [21:47:09] eh, I don't think GE image recommendations are set up on any wmf.14 wiki [21:47:14] subbu: ack, tgr are you testing something or is that separate from this [21:47:23] wanted to, but can't [21:47:32] okay, will sync [21:47:36] will double-check tomorrow just in case [21:48:08] (but the test passed 3 out of times so it's very likely the usual selenium timing thing) [21:48:19] ...out of 4... [21:48:38] Ack [21:48:39] (03Merged) 10jenkins-bot: VisualEnhancements: in some languages put an arrow by the reply button [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867620 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [21:51:41] Kemayo: after this I'll roll 619/620 into one deploy and then do the config patch — sound okay? [21:51:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew) 05Open→03Resolved I've imaged all these servers and put 1054 in the 'ceph' pool and the others in the 'spare'... [21:51:51] Sounds good to me. [21:53:30] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:868049|Parsoid: don't bypass ParserCache when using Title]] (duration: 11m 13s) [21:53:33] subbu: live in prod [21:53:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867619 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [21:53:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867620 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [21:53:50] ty! duesen ^ if you want to test your hebrew skills. :) [21:54:14] !log samtar@deploy1002 Started scap: Backport for [[gerrit:867619|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]], [[gerrit:867620|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]] [21:54:17] T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537 [21:55:04] Kemayo: those two are live on mwdebug [21:55:32] But no the config patch yet, right? [21:55:48] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg) Thanks @Vgutierrez ! [21:56:00] yes true, so can't be tested yet I suppose? [21:56:02] !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:867619|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]], [[gerrit:867620|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:56:59] TheresNoTime: I've verified that it's somewhat working via `uselang` elsewhere, so it's looking promising. [21:57:11] will sync :) [21:57:32] (03PS5) 10Samtar: Deployment of DiscussionTools reply visual enhancements for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [21:59:29] * subbu is signing off for a bit and back online in about 15 mins. [21:59:34] o/ [22:03:09] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:867619|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]], [[gerrit:867620|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]] (duration: 08m 55s) [22:03:13] T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537 [22:03:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [22:04:08] (03Merged) 10jenkins-bot: Deployment of DiscussionTools reply visual enhancements for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [22:04:38] !log samtar@deploy1002 Started scap: Backport for [[gerrit:867311|Deployment of DiscussionTools reply visual enhancements for more wikis (T323537)]] [22:06:23] !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:867311|Deployment of DiscussionTools reply visual enhancements for more wikis (T323537)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:06:25] Kemayo: okay, config patch is on mwdebug [22:06:55] TheresNoTime: Looks good! [22:07:03] cool, syncing :) [22:12:51] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:867311|Deployment of DiscussionTools reply visual enhancements for more wikis (T323537)]] (duration: 08m 12s) [22:12:54] Kemayo: live in prod [22:12:55] T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537 [22:13:02] tgr: all yours if you still want to [22:13:16] thanks! [22:13:25] TheresNoTime: Thanks! [22:13:36] duesen, the patch may have done the trick ... the latency is now in the much lower range ... will probably be clearer by tomorrow. [22:15:23] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10bd808) Related: {T148048} [22:16:23] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868047 (https://phabricator.wikimedia.org/T325041) (owner: 10Kosta Harlan) [22:16:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868051 (owner: 10Gergő Tisza) [22:24:04] (03PS1) 10BryanDavis: toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T286164) [22:24:35] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:28:45] (03PS2) 10BryanDavis: toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T195681) [22:32:39] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert1001.wikimedia.org [22:36:04] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2009.codfw.wmnet with reason: NFS troubleshooting [22:36:05] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on wdqs2009.codfw.wmnet with reason: NFS troubleshooting [22:36:54] (03Merged) 10jenkins-bot: NewImpact: Add log event for clicking suggested edits button [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868047 (https://phabricator.wikimedia.org/T325041) (owner: 10Kosta Harlan) [22:36:59] (03Merged) 10jenkins-bot: UserEditTracker: Allow querying primary DB for edit timestamp [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868051 (owner: 10Gergő Tisza) [22:37:26] !log tgr@deploy1002 Started scap: Backport for [[gerrit:868047|NewImpact: Add log event for clicking suggested edits button (T325041)]], [[gerrit:868051|UserEditTracker: Allow querying primary DB for edit timestamp]] [22:37:31] T325041: Bring NewImpact logging on par with old Impact - https://phabricator.wikimedia.org/T325041 [22:39:14] !log tgr@deploy1002 tgr and kharlan and tgr: Backport for [[gerrit:868047|NewImpact: Add log event for clicking suggested edits button (T325041)]], [[gerrit:868051|UserEditTracker: Allow querying primary DB for edit timestamp]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:43:50] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10mpopov) I just updated @Fuzzy's permissions for he.m.wikisource. Used to be 'Restricted' but 'Full' now (same as he.wikisource). >... [22:46:22] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host alert1001.wikimedia.org [22:47:14] ^ it was expected that the last test fails as it connects to external services. [22:47:20] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:47:22] I'm already looking at the hosts health. [22:48:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [22:48:50] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.203 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:49:03] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:868047|NewImpact: Add log event for clicking suggested edits button (T325041)]], [[gerrit:868051|UserEditTracker: Allow querying primary DB for edit timestamp]] (duration: 11m 37s) [22:49:07] T325041: Bring NewImpact logging on par with old Impact - https://phabricator.wikimedia.org/T325041 [22:53:17] (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T195681) (owner: 10BryanDavis) [22:55:04] PROBLEM - Check systemd state on mw2374 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:20] !log doing the last backport by hand due to T325252 [22:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:24] T325252: scap backport fails with "Multiple changes found" - https://phabricator.wikimedia.org/T325252 [22:57:54] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:58:34] (03Merged) 10jenkins-bot: toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T195681) (owner: 10BryanDavis) [22:58:48] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:41] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [23:00:48] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:01:02] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [23:03:10] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [23:03:17] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert2001.wikimedia.org [23:03:18] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2001.wikimedia.org [23:03:42] (03CR) 10Gergő Tisza: [C: 03+2] User impact: read edit count from primary db in save complete hook [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868052 (https://phabricator.wikimedia.org/T324930) (owner: 10Gergő Tisza) [23:04:41] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [23:08:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [23:10:41] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on alert2001.wikimedia.org with reason: kernel update [23:10:42] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [23:10:42] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on alert2001.wikimedia.org with reason: kernel update [23:12:13] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [23:12:44] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:15] !log Toolhub: rebuilding search indices following app update [23:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:48] (03PS1) 10Bking: [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) [23:17:35] (03PS1) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) [23:18:54] (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [23:20:24] (03CR) 10Herron: [C: 03+1] librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [23:20:51] (03Merged) 10jenkins-bot: User impact: read edit count from primary db in save complete hook [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868052 (https://phabricator.wikimedia.org/T324930) (owner: 10Gergő Tisza) [23:21:52] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:22:04] PROBLEM - Host wdqs2009 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:41] (03PS2) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) [23:24:22] RECOVERY - Host wdqs2009 is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms [23:24:40] (03PS4) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) [23:24:54] (03PS3) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) [23:25:37] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [23:25:47] (03CR) 10Dzahn: "I don't think it hurts but in other cases we have also just kept it at 5M permanently in case we need to switch and afaict there is no dow" [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [23:27:31] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2011.* [23:27:37] (03CR) 10Herron: [C: 03+1] "LGTM 🕳" [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) (owner: 10Cwhite) [23:27:37] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2012.* [23:28:29] !log T301167 wdqs2011/2012 were not visible in pybal (oversight from when I added the other hosts with conftool last week). Fixed that, so now all of the new hosts are showing up properly. [23:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:33] T301167: Service implementation for wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T301167 [23:29:40] !log [WDQS] Downtimed wdqs20[09-12] for the next 7 days [23:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:44] (03CR) 10Andrea Denisse: [C: 03+2] librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [23:29:58] (03CR) 10Andrea Denisse: [C: 03+2] librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [23:31:14] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:31:26] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:32:48] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:33:00] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:33:16] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2026'] [23:33:51] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/868199/38794/deploy2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn) [23:34:12] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect, ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:34:47] (03PS4) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) [23:35:04] PROBLEM - OpenSearch health check for shards on 9200 on logstash2026 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f8165543390: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [23:35:04] org/wiki/Search%23Administration [23:36:00] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) (owner: 10Cwhite) [23:40:57] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2026'] [23:41:30] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2026'] [23:44:46] PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100% [23:46:40] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:46:54] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:12] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2026'] [23:48:36] RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [23:48:45] (JobUnavailable) firing: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:48:56] RECOVERY - OpenSearch health check for shards on 9200 on logstash2026 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 662, active_shards: 1465, relocating_shards: 6, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [23:48:56] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:50:33] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2026.codfw.wmnet with OS bullseye [23:53:06] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:53:45] (JobUnavailable) resolved: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:55:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:57:44] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:59:58] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status