[00:08:36] <wikibugs>	 (03PS47) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[00:09:32] <wikibugs>	 (03PS48) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[00:11:47] <wikibugs>	 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Seksen) So based purely on Krinkle's comment above it does appear that this would likely be a problem with the parser cache.  Adding ?123 query string into th...
[00:12:04] <wikibugs>	 (03CR) 10Raymond Ndibe: "ignore the last three patches. I was trying to fix a problem I introduced when I did git pull" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[00:13:28] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10colewhite)
[00:17:59] <wikibugs>	 (03CR) 10Raymond Ndibe: "thanks David for working on this. this will make testing easier than it currently is" [puppet] - 10https://gerrit.wikimedia.org/r/867566 (owner: 10David Caro)
[00:25:46] <wikibugs>	 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle) @Seksen When browsing with a login session, you do still enjoy the performance benefit of the ParserCache, this is appl...
[00:25:49] <wikibugs>	 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Krinkle)
[00:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[00:45:46] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2036.codfw.wmnet with OS bullseye
[00:46:59] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2037.codfw.wmnet with OS bullseye
[01:00:39] <wikibugs>	 (03PS1) 10Cwhite: logstash: heavily restrict mediawiki http accesslog during initial onboarding [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439)
[01:01:02] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage
[01:02:24] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage
[01:04:08] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2036.codfw.wmnet with reason: host reimage
[01:05:12] <wikibugs>	 (03PS1) 10Eevans: Migrate echostore & sessionstore staging to new cassandra-dev cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/867733 (https://phabricator.wikimedia.org/T324113)
[01:06:04] <wikibugs>	 (03PS1) 10Cwhite: site: assign role logging::opensearch::data to logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/867631 (https://phabricator.wikimedia.org/T321335)
[01:06:40] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2037.codfw.wmnet with reason: host reimage
[01:17:54] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[01:18:18] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2036.codfw.wmnet with OS bullseye
[01:22:01] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2037.codfw.wmnet with OS bullseye
[01:22:12] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[01:27:54] <wikibugs>	 (03CR) 10RLazarus: slo_dashboards: dynamic slo dashboard panels (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[01:34:00] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[01:35:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[01:40:15] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 203 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:51:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:54:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 188 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:55:15] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:24] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:10:15] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:15] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:39:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:43:26] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:00:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:29:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[04:42:21] <wikibugs>	 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Ladsgroup)
[05:54:45] <wikibugs>	 (03PS1) 10Ladsgroup: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T312666)
[05:55:13] <wikibugs>	 (03PS2) 10Ladsgroup: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662)
[05:55:50] <wikibugs>	 (03PS3) 10Ladsgroup: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662)
[06:47:12] <wikibugs>	 (03PS1) 10Marostegui: misc.my.cnf, production.my.cnf: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867909
[06:47:54] <wikibugs>	 (03PS1) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689)
[06:48:16] <wikibugs>	 (03PS1) 10Raymond Ndibe: tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689)
[06:48:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] misc.my.cnf, production.my.cnf: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867909 (owner: 10Marostegui)
[06:50:17] <wikibugs>	 (03PS1) 10Marostegui: mariadb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867912
[06:50:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/867912 (owner: 10Marostegui)
[07:12:27] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] wm-checks-api: show processor prototype name on error [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867648 (owner: 10Hashar)
[07:12:37] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] wm-checks-api: parse PipelineLib messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867656 (owner: 10Hashar)
[07:13:00] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: show processor prototype name on error [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867648 (owner: 10Hashar)
[07:13:06] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: parse PipelineLib messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/867656 (owner: 10Hashar)
[07:28:36] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@7ba179f]: (no justification provided)
[07:28:44] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@7ba179f]: (no justification provided) (duration: 00m 08s)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T0800)
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:07:42] <wikibugs>	 (03PS1) 10Hashar: scap: disable git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128)
[08:15:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: adapt releases to the changes upstream in puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/868005
[08:17:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This will need for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868005 to be merged afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli)
[08:17:45] <wikibugs>	 (03CR) 10Hashar: "I tried to cherry pick it on the beta cluster Puppet master and it has the same issue since the repository is owned by "gitpuppet". I have" [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar)
[08:17:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867706 (https://phabricator.wikimedia.org/T324696) (owner: 10JHathaway)
[08:18:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin)
[08:24:18] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:24:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:25:53] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068
[08:25:57] <stashbot>	 T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068
[08:26:04] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068 (duration: 00m 11s)
[08:29:38] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:30:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:33:53] <hashar>	 I am going to restart Gerrit for a plugin upgrade
[08:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[08:37:43] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@353573b]: HDFS usage dataset pipeline deployment without superuser TEST [airflow-dags@353573b]
[08:37:54] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@353573b]: HDFS usage dataset pipeline deployment without superuser TEST [airflow-dags@353573b] (duration: 00m 10s)
[08:39:16] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@353573b]: HDFS usage dataset pipeline deployment without superuser [airflow-dags@353573b]
[08:39:29] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@353573b]: HDFS usage dataset pipeline deployment without superuser [airflow-dags@353573b] (duration: 00m 13s)
[08:39:47] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068
[08:39:50] <stashbot>	 T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068
[08:39:56] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c0b0a70]: Add support for PipelineBot to the Checks API plugin - T214068 (duration: 00m 09s)
[08:41:59] <hashar>	 !log Restarted Gerrit for a plugin update
[08:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:08] <hashar>	 this time it stopped almost instantly
[08:52:22] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) a:03elukey
[09:00:05] <jouncebot>	 hashar and ^demon: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T0900).
[09:01:58] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[09:02:55] <wikibugs>	 (03PS1) 10JMeybohm: calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191)
[09:02:57] <wikibugs>	 (03PS1) 10JMeybohm: calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191)
[09:03:48] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[09:04:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[09:05:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[09:06:01] <hashar>	 oh the tran
[09:06:03] <hashar>	 train
[09:07:25] <hashar>	 I need to check the overnight logs first
[09:07:39] <hashar>	 I had some side work to do this morning
[09:11:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[09:11:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[09:13:50] <wikibugs>	 (03PS1) 10Ladsgroup: search: Avoid setting height in search thumbnails [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868046 (https://phabricator.wikimedia.org/T322621)
[09:13:57] <Amir1>	 jouncebot: nowandnext
[09:13:57] <jouncebot>	 For the next 1 hour(s) and 46 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T0900)
[09:13:57] <jouncebot>	 In 4 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400)
[09:14:15] <Amir1>	 hashar: I quickly backport something
[09:14:20] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] search: Avoid setting height in search thumbnails [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868046 (https://phabricator.wikimedia.org/T322621) (owner: 10Ladsgroup)
[09:15:50] <hashar>	 Amir1: please do ;)
[09:16:16] <hashar>	 I am digging in one of the error I have missed yesterday night
[09:16:31] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Make ganeti worker nodes peer with core routers (aux) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868029 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[09:16:33] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Make ganeti worker nodes peer with core routers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868030 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[09:20:30] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:21:09] <hashar>	 I think I will block the train
[09:21:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6001.wikimedia.org
[09:21:19] <hashar>	 there is an error happening since yesterday and I don't know the impact
[09:21:36] <hashar>	 beside that it refers to parsoid which sounds scary
[09:23:08] <Amir1>	 that is possibly related to what Daniel is doing, I suggest bringing it up in restbase-sunset in slack
[09:23:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[09:25:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:26:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet
[09:27:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6001.wikimedia.org
[09:27:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:27:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:27:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[09:28:06] <wikibugs>	 (03Merged) 10jenkins-bot: search: Avoid setting height in search thumbnails [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868046 (https://phabricator.wikimedia.org/T322621) (owner: 10Ladsgroup)
[09:28:20] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[09:28:25] <hashar>	 Amir1: ah thank you, doing so
[09:28:36] <Amir1>	 I just did :D
[09:29:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5002.wikimedia.org
[09:30:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-corp2001.wikimedia.org
[09:31:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:34:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1001.eqiad.wmnet
[09:35:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp2001.wikimedia.org
[09:37:25] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) >>! In T188561#8464256, @DBu-WMF wrote: > @Vgutierrez is there anything left to do so that we can move forward on this task?  P...
[09:37:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-corp1001.wikimedia.org
[09:39:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5002.wikimedia.org
[09:39:50] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1002.eqiad.wmnet
[09:40:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4003.wikimedia.org
[09:41:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp1001.wikimedia.org
[09:45:05] <wikibugs>	 (03CR) 10Jelto: "thanks for the detailed review! I uploaded a new patchset." [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[09:45:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1002.eqiad.wmnet
[09:46:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4003.wikimedia.org
[09:46:45] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.14/includes/search/SearchResultThumbnailProvider.php: Backport: [[gerrit:868046|search: Avoid setting height in search thumbnails (T322621)]] (duration: 08m 07s)
[09:46:48] <stashbot>	 T322621: Use standard thumbsizes in modern vector search - https://phabricator.wikimedia.org/T322621
[09:47:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1003.eqiad.wmnet
[09:53:26] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet
[09:54:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1003.eqiad.wmnet
[09:54:42] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM 👍" [puppet] - 10https://gerrit.wikimedia.org/r/867594 (owner: 10Slyngshede)
[09:55:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging-etcd2001.codfw.wmnet
[09:55:43] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=no; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet
[09:56:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org
[09:59:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging-etcd2001.codfw.wmnet
[10:00:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org
[10:01:35] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:03:23] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] base::cloud_production: introduce new profile (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:03:26] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "LGTM, thanks again!" [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[10:04:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[10:06:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3005.wikimedia.org
[10:07:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10aborrero) 05Resolved→03Open a:05Cmjohnson→03cmooney Reopening until switch changes are made by @cmooney
[10:08:29] <wikibugs>	 (03PS5) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[10:08:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu)
[10:10:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3005.wikimedia.org
[10:10:50] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] base::cloud_production: introduce new profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:11:29] <wikibugs>	 (03PS6) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[10:11:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu)
[10:12:11] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging-etcd2002.codfw.wmnet
[10:14:52] <wikibugs>	 (03PS1) 10Kosta Harlan: NewImpact: Add log event for clicking suggested edits button [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868047 (https://phabricator.wikimedia.org/T325041)
[10:16:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging-etcd2002.codfw.wmnet
[10:17:28] <wikibugs>	 (03CR) 10Jbond: "adding brain who knows the environment better then i" [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[10:17:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2002.codfw.wmnet
[10:17:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet
[10:18:31] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add trusted tag to Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069)
[10:19:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging-etcd2003.codfw.wmnet
[10:20:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm merging thanks" [puppet] - 10https://gerrit.wikimedia.org/r/867579 (owner: 10Hashar)
[10:21:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2002.codfw.wmnet
[10:23:23] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069)
[10:23:41] <wikibugs>	 (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:24:18] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet
[10:24:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet
[10:25:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet
[10:25:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging-etcd2003.codfw.wmnet
[10:28:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1001.eqiad.wmnet
[10:28:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet
[10:29:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6001.wikimedia.org
[10:30:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moscovium.eqiad.wmnet
[10:30:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1001.eqiad.wmnet
[10:31:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1002.eqiad.wmnet
[10:31:55] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=no; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet
[10:33:36] <wikibugs>	 (03PS1) 10Jcrespo: icinga: Make the punctuation error optional on check [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169)
[10:33:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6001.wikimedia.org
[10:34:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moscovium.eqiad.wmnet
[10:34:57] <wikibugs>	 10SRE, 10Traffic: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) 05Open→03In progress After an initial check this seems to be an issue on Varnish, ATS sets `X-Cache-Int` to `miss`: ` vgutierrez@cp6003:~$ curl -H 'Host: upload.wiki...
[10:35:04] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1002.eqiad.wmnet
[10:35:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5001.wikimedia.org
[10:35:50] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1003.eqiad.wmnet
[10:36:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Thanks @Dzahn. We should probably add that final step to the [hardware troubleshooting runbook](https://wikitech.wikimedia.org/wiki/SRE/Dc-ope...
[10:37:07] <wikibugs>	 10SRE, 10Observability-Alerting, 10observability, 10Patch-For-Review: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10jcrespo) >>! In T317169#8240929, @Dzahn wrote: > After pondering this a bit more I now think the _actual fix_ would be if Wikipedia and other projec...
[10:39:43] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1003.eqiad.wmnet
[10:40:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5001.wikimedia.org
[10:40:11] <wikibugs>	 (03CR) 10Jcrespo: "This should fix https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=en.wikibooks.org&service=Ensure+legal+html+en.wb" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[10:41:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4001.wikimedia.org
[10:43:35] <wikibugs>	 (03PS24) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596)
[10:43:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Added documentation to avoid forgetting this step, DC-Ops feel free to revert or ask me to move it elsewhere if you feel it shouldn't be there.
[10:43:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez)
[10:46:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4001.wikimedia.org
[10:49:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3001.wikimedia.org
[10:50:45] <wikibugs>	 (03CR) 10Volans: "replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:52:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1001.eqiad.wmnet
[10:54:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3001.wikimedia.org
[10:54:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2003.wikimedia.org
[10:54:33] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet
[10:58:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1001.eqiad.wmnet
[10:58:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2003.wikimedia.org
[10:59:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] base::cloud_production: introduce new profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:59:44] <wikibugs>	 (03PS4) 10Volans: cumin::cloud_master: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401)
[10:59:48] <wikibugs>	 (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[11:00:12] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet
[11:00:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1003.wikimedia.org
[11:00:40] <moritzm>	 !log installing dpkg bugfix updates from Bullseye point release
[11:00:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:33] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet
[11:04:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1003.wikimedia.org
[11:09:06] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet
[11:09:15] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2002.codfw.wmnet
[11:10:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1002.eqiad.wmnet
[11:10:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn)
[11:11:04] <wikibugs>	 (03PS25) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596)
[11:11:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:17:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1002.eqiad.wmnet
[11:18:28] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Reproduce T324956 in a VTC test [puppet] - 10https://gerrit.wikimedia.org/r/868043 (https://phabricator.wikimedia.org/T324956)
[11:18:47] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1003.eqiad.wmnet
[11:19:11] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ml-cache2002.codfw.wmnet
[11:20:00] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is CRITICAL: connect to address 10.192.16.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[11:20:00] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.190:7001 on ml-cache2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[11:20:02] <icinga-wm>	 PROBLEM - cassandra-a service on ml-cache2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:20:02] <icinga-wm>	 PROBLEM - Check systemd state on ml-cache2002 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:18] <icinga-wm>	 RECOVERY - Check systemd state on ml-cache2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:28] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.190:7001 on ml-cache2002 is OK: SSL OK - Certificate ml-cache2002-a valid until 2024-06-15 08:50:24 +0000 (expires in 548 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[11:22:30] <icinga-wm>	 RECOVERY - cassandra-a service on ml-cache2002 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:26:03] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1003.eqiad.wmnet
[11:26:30] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is OK: TCP OK - 0.032 second response time on 10.192.16.190 port 9042 https://phabricator.wikimedia.org/T93886
[11:26:55] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache2003.codfw.wmnet
[11:29:56] <wikibugs>	 (03PS26) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596)
[11:30:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez)
[11:31:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:32:10] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) 05In progress→03Stalled A [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/868043/1/modules/varnish/files/tests/text/33-x-cache-status.v...
[11:32:30] <wikibugs>	 (03PS1) 10David Caro: Allow overriding the cookbooks module name [software/spicerack] - 10https://gerrit.wikimedia.org/r/868067 (https://phabricator.wikimedia.org/T319436)
[11:33:25] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:34:51] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2003.codfw.wmnet
[11:37:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow overriding the cookbooks module name [software/spicerack] - 10https://gerrit.wikimedia.org/r/868067 (https://phabricator.wikimedia.org/T319436) (owner: 10David Caro)
[11:38:19] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet
[11:38:25] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:42:12] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet
[11:42:24] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet
[11:46:11] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet
[11:49:57] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet
[11:51:56] <wikibugs>	 (03PS1) 10Marostegui: parsercache.my.cnf.erb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/868069
[11:52:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10cmooney) Ok the ports are reconfigured now if you want to give it another shot @Andrew
[11:52:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf.erb: innodb_change_buffering status [puppet] - 10https://gerrit.wikimedia.org/r/868069 (owner: 10Marostegui)
[11:54:46] <wikibugs>	 10SRE, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10STran)
[11:55:28] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet
[11:55:40] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38763/console" [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez)
[11:57:27] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "looks good, awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez)
[11:58:09] <wikibugs>	 10SRE, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10STran) I'm aware this is a duplicate of {T290917} but I made it anyway because: - afaik, the the scope of security-api has changed (for now). Whatever's being implemented is for IPInfo's spec...
[11:58:47] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=2:pooled=yes; selector: service=thumbor,name=kubernetes101[123].eqiad.wmnet
[11:59:17] <wikibugs>	 (03PS1) 10Majavah: openstack::haproxy::site: don't provision backend FW rules [puppet] - 10https://gerrit.wikimedia.org/r/868070
[11:59:20] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet
[11:59:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm some minor comments questions inline but nothing blocking" [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli)
[12:00:33] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] cloudlb: introduce role skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez)
[12:00:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:07:24] <Amir1>	 jouncebot: nowandnext
[12:07:24] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 52 minute(s)
[12:07:24] <jouncebot>	 In 1 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400)
[12:07:33] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup)
[12:07:58] <wikibugs>	 (03PS1) 10Majavah: labstore: nfs-mounts: add dumps for qrank [puppet] - 10https://gerrit.wikimedia.org/r/868071 (https://phabricator.wikimedia.org/T324952)
[12:08:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup)
[12:08:40] <wikibugs>	 (03Merged) 10jenkins-bot: Externallinks: Set Persian Wikiquote to WRITE BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867740 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup)
[12:09:13] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582)
[12:09:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38765/console" [puppet] - 10https://gerrit.wikimedia.org/r/868071 (https://phabricator.wikimedia.org/T324952) (owner: 10Majavah)
[12:09:40] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:867740|Externallinks: Set Persian Wikiquote to WRITE BOTH (T321662)]]
[12:09:44] <stashbot>	 T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662
[12:11:22] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10hnowlan)
[12:11:25] <jbond>	 !log disable puppet fleet wide to preform server reboots
[12:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[12:14:05] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb1002.eqiad.wmnet
[12:14:23] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582)
[12:18:14] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1001.eqiad.wmnet
[12:19:49] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki2001.codfw.wmnet
[12:20:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: nfs-mounts: add dumps for qrank [puppet] - 10https://gerrit.wikimedia.org/r/868071 (https://phabricator.wikimedia.org/T324952) (owner: 10Majavah)
[12:20:37] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb2002.codfw.wmnet
[12:23:07] <wikibugs>	 (03CR) 10Marostegui: "I would prefer something a bit more meaningful than b1, my first reaction was related to PDUs/racks :)" [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[12:25:42] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb1002.eqiad.wmnet
[12:26:51] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10jbond) p:05Triage→03Medium
[12:26:55] <wikibugs>	 (03PS1) 10Volans: config: allow to spcify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074
[12:26:57] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196)
[12:27:04] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host pki2001.codfw.wmnet
[12:27:40] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Ladsgroup) This is not really user-impacting, specially given that mw-on-k8s is on test2wiki only but I think it should show up in next week's Tech news regardle...
[12:28:53] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:867740|Externallinks: Set Persian Wikiquote to WRITE BOTH (T321662)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[12:28:57] <stashbot>	 T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662
[12:29:48] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1001.eqiad.wmnet
[12:30:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:30:29] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2002.codfw.wmnet
[12:35:04] <wikibugs>	 (03PS7) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[12:35:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu)
[12:36:05] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[12:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:37:06] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:38:03] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[12:38:58] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:867740|Externallinks: Set Persian Wikiquote to WRITE BOTH (T321662)]] (duration: 29m 18s)
[12:39:02] <stashbot>	 T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662
[12:41:41] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: increase cpu limit to 1.5 per instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/868075 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:42:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[12:44:38] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[12:47:23] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[12:51:44] <wikibugs>	 (03PS2) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[12:53:53] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10akosiaris) >>! In T290536#8466377, @Ladsgroup wrote: >  - Inviting tech users to test our the new infra and let us know of issues early on.  A related note. 2 th...
[12:54:01] <wikibugs>	 (03PS8) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[12:54:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu)
[12:55:44] <wikibugs>	 (03Abandoned) 10David Caro: Allow overriding the cookbooks module name [software/spicerack] - 10https://gerrit.wikimedia.org/r/868067 (https://phabricator.wikimedia.org/T319436) (owner: 10David Caro)
[12:55:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 T325154', diff saved to https://phabricator.wikimedia.org/P42692 and previous config saved to /var/cache/conftool/dbconfig/20221214-125544-marostegui.json
[12:55:49] <stashbot>	 T325154: Clean up unix_socket flag in my.cnf - https://phabricator.wikimedia.org/T325154
[12:56:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb1003.eqiad.wmnet
[12:56:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Don't install quickstack on Bookworm, revisit later [puppet] - 10https://gerrit.wikimedia.org/r/868078 (https://phabricator.wikimedia.org/T321783)
[12:58:10] <wikibugs>	 (03PS2) 10Muehlenhoff: Don't install quickstack on Bookworm, revisit later [puppet] - 10https://gerrit.wikimedia.org/r/868078 (https://phabricator.wikimedia.org/T321783)
[12:59:07] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[12:59:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P42693 and previous config saved to /var/cache/conftool/dbconfig/20221214-125928-root.json
[12:59:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 T325154', diff saved to https://phabricator.wikimedia.org/P42694 and previous config saved to /var/cache/conftool/dbconfig/20221214-125950-marostegui.json
[13:01:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 5%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42695 and previous config saved to /var/cache/conftool/dbconfig/20221214-130119-root.json
[13:02:11] <wikibugs>	 (03PS1) 10Marostegui: db_inventory.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868079 (https://phabricator.wikimedia.org/T325154)
[13:06:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db_inventory.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868079 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[13:08:34] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: increase cpu limit, reduce workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196)
[13:09:28] <wikibugs>	 (03PS3) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[13:14:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "I suppose you'll test raising the number of replicas separately?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[13:14:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P42696 and previous config saved to /var/cache/conftool/dbconfig/20221214-131433-root.json
[13:16:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42697 and previous config saved to /var/cache/conftool/dbconfig/20221214-131624-root.json
[13:17:57] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38766/console" [puppet] - 10https://gerrit.wikimedia.org/r/868070 (owner: 10Majavah)
[13:27:41] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "This works for me 👍" [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (owner: 10Volans)
[13:27:46] <wikibugs>	 (03PS1) 10Marostegui: sanitarium_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868083 (https://phabricator.wikimedia.org/T325154)
[13:29:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P42699 and previous config saved to /var/cache/conftool/dbconfig/20221214-132938-root.json
[13:29:41] <wikibugs>	 (03PS1) 10Ladsgroup: Parsoid: Default parsoid version to "0.0.0" for unsupported models [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137)
[13:29:52] <Amir1>	 jouncebot: nowandnext
[13:29:52] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 30 minute(s)
[13:29:52] <jouncebot>	 In 0 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400)
[13:31:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42700 and previous config saved to /var/cache/conftool/dbconfig/20221214-133129-root.json
[13:32:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:35:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Adding ihurbain to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/867657 (https://phabricator.wikimedia.org/T325080) (owner: 10Isabelle Hurbain-Palatin)
[13:37:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add ihurbain to parsoid-test-roots - https://phabricator.wikimedia.org/T325080 (10akosiaris) 05Open→03Resolved I guess all that's left for me as a clinic duty person is to merge the change and resolve the task. Done and done. Thanks everyone! @ihurbain pl...
[13:38:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 04-1] Example strategy for marking DSCP with ferm and puppet integration (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[13:42:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10akosiaris) @Fuzzy, did the updated permissions work out ok? Can we resolve this task?
[13:44:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P42701 and previous config saved to /var/cache/conftool/dbconfig/20221214-134443-root.json
[13:46:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42702 and previous config saved to /var/cache/conftool/dbconfig/20221214-134634-root.json
[13:55:21] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Parsoid: Default parsoid version to "0.0.0" for unsupported models [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137) (owner: 10Ladsgroup)
[13:59:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P42703 and previous config saved to /var/cache/conftool/dbconfig/20221214-135948-root.json
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1400).
[14:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:10] <Lucas_WMDE>	 o/
[14:00:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] sanitarium_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868083 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[14:00:20] <Lucas_WMDE>	 yup, looks like nothing to do ^^
[14:01:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42704 and previous config saved to /var/cache/conftool/dbconfig/20221214-140139-root.json
[14:03:51] <wikibugs>	 (03PS1) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085
[14:04:26] <wikibugs>	 (03PS2) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't mainain [alerts] - 10https://gerrit.wikimedia.org/r/868085
[14:04:31] <wikibugs>	 (03PS3) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't maintain [alerts] - 10https://gerrit.wikimedia.org/r/868085
[14:06:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-data-persistence: Stop alerting on dbs the team doesn't maintain [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup)
[14:06:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] team-data-persistence: Stop alerting on dbs the team doesn't maintain [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup)
[14:08:15] <wikibugs>	 (03PS4) 10Ladsgroup: team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085
[14:09:39] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup)
[14:09:55] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri)
[14:10:01] <wikibugs>	 (03PS2) 10Volans: config: allow to specify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168)
[14:10:09] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DBu-WMF) @greg can we move forward and turn click-tracking back on in Acoustic?
[14:10:44] <wikibugs>	 (03Merged) 10jenkins-bot: Parsoid: Default parsoid version to "0.0.0" for unsupported models [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137) (owner: 10Ladsgroup)
[14:10:54] <wikibugs>	 (03Merged) 10jenkins-bot: team-data-persistence: Stop alerting on dbs the team doesn't mainaint [alerts] - 10https://gerrit.wikimedia.org/r/868085 (owner: 10Ladsgroup)
[14:11:02] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri)
[14:11:18] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri)
[14:11:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868048 (https://phabricator.wikimedia.org/T325137) (owner: 10Ladsgroup)
[14:11:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wcqs1003.eqiad.wmnet
[14:11:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wcqs1003.eqiad.wmnet
[14:11:52] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:868048|Parsoid: Default parsoid version to "0.0.0" for unsupported models (T325137)]]
[14:11:55] <stashbot>	 T325137: UnexpectedValueException: Invalid version string "" - https://phabricator.wikimedia.org/T325137
[14:13:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10bking) Removed downtime and repooled WCQS as it sounds like reseating the hard drives may have fixed it. @Jclark-ctr let us know if you hear anythi...
[14:13:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet
[14:13:43] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:868048|Parsoid: Default parsoid version to "0.0.0" for unsupported models (T325137)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[14:14:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P42705 and previous config saved to /var/cache/conftool/dbconfig/20221214-141453-root.json
[14:15:27] <wikibugs>	 (03PS1) 10Volans: cookbooks: remote top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168)
[14:15:28] <wikibugs>	 (03PS1) 10Volans: cookboos.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088
[14:15:45] <wikibugs>	 (03PS2) 10Volans: cookbooks: remote top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168)
[14:16:41] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[14:16:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After testing unix_socket plugin', diff saved to https://phabricator.wikimedia.org/P42706 and previous config saved to /var/cache/conftool/dbconfig/20221214-141644-root.json
[14:17:02] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "-1 for now, depends on the Spicerack release with the related patch" [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[14:17:07] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri) 05Open→03In progress p:05Triage→03Medium
[14:17:34] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10fnegri) a:03Volans
[14:17:44] <wikibugs>	 (03PS9) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[14:19:32] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38767/console" [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche)
[14:20:03] <wikibugs>	 (03PS10) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[14:20:04] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:868048|Parsoid: Default parsoid version to "0.0.0" for unsupported models (T325137)]] (duration: 08m 12s)
[14:20:08] <stashbot>	 T325137: UnexpectedValueException: Invalid version string "" - https://phabricator.wikimedia.org/T325137
[14:20:26] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche)
[14:21:52] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] docker_registry_ha: add contint2002 to image builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/867708 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[14:22:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve-inference: fix dependencies in Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/867600 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey)
[14:22:58] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+1] mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche)
[14:24:10] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] logstash: heavily restrict mediawiki http accesslog during initial onboarding [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) (owner: 10Cwhite)
[14:28:45] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 (owner: 10Jaime Nuche)
[14:28:47] <wikibugs>	 (03PS3) 10FNegri: cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[14:29:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P42707 and previous config saved to /var/cache/conftool/dbconfig/20221214-142958-root.json
[14:30:47] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb1003.eqiad.wmnet
[14:38:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10Jclark-ctr) 05Open→03Resolved
[14:44:10] <wikibugs>	 (03PS11) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[14:44:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580) (owner: 10Aqu)
[14:44:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10SCherukuwada) Apologies, I don't have access to wikisource. @mpopov does probably.
[14:44:52] <wikibugs>	 (03PS4) 10Volans: cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168)
[14:44:54] <wikibugs>	 (03PS2) 10Volans: cookboos.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088
[14:45:24] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[14:46:07] <wikibugs>	 (03PS12) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[14:47:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host xhgui2001.codfw.wmnet
[14:50:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui2001.codfw.wmnet
[14:51:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: increase cpu limit, reduce workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:52:04] <wikibugs>	 (03PS13) 10Aqu: Use Airflow 2.4.3 + Postgres in test-cluster [puppet] - 10https://gerrit.wikimedia.org/r/867668 (https://phabricator.wikimedia.org/T315580)
[14:52:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: increase cpu limit, reduce workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:54:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10decommission-hardware, 10serviceops-collab: decommission contint1001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T325102 (10Jclark-ctr)
[14:54:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10decommission-hardware, 10serviceops-collab: decommission contint1001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T325102 (10Jclark-ctr) 05Open→03Resolved
[14:54:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host xhgui1001.eqiad.wmnet
[14:55:10] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191)
[14:55:14] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1005.eqiad.wmnet
[14:55:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add Sondes Ben Chagra to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/867706 (https://phabricator.wikimedia.org/T324696) (owner: 10JHathaway)
[14:55:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add Sondes Ben Chagra to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/867706 (https://phabricator.wikimedia.org/T324696) (owner: 10JHathaway)
[14:56:27] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: increase cpu limit, reduce workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/868081 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:57:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey)
[14:57:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1004.eqiad.wmnet
[14:57:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[14:58:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Keep deprecated failure-domain.beta.* labels around in 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867589 (https://phabricator.wikimedia.org/T270191) (owner: 10JMeybohm)
[14:58:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Planning, 10LDAP-Access-Requests, and 2 others: Grant Access to 'wmf' LDAP group for 'Sbenchagra' - https://phabricator.wikimedia.org/T324696 (10akosiaris) 05Open→03Resolved a:03akosiaris Thanks @jhathaway. user has been added to the WMF group. Resolvi...
[14:58:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui1001.eqiad.wmnet
[14:59:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) With {T270191} I've changed the zone of k8s ganeti workers to to their respective ganeti cluster and g...
[15:00:10] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:00:10] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:00:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10akosiaris) Hello @sbassett, I see we are still missing some input here, any updates?
[15:00:31] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1006.eqiad.wmnet
[15:00:53] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1005.eqiad.wmnet
[15:01:50] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:01:52] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:03:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10akosiaris) 05Stalled→03Invalid Since there are no updates on this task and it pretty much appears to be a duplicate of T324057, I 'll resolve as `invalid` (not mergin...
[15:06:01] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) 05Open→03In progress
[15:07:15] <wikibugs>	 (03PS1) 10FNegri: Remove non-wmcs files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092
[15:07:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[15:07:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1006.eqiad.wmnet
[15:08:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1007.eqiad.wmnet
[15:09:02] <wikibugs>	 (03CR) 10FNegri: [C: 04-2] "DO NOT MERGE. Will be pushed to the new Git repo." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092 (owner: 10FNegri)
[15:10:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove non-wmcs files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092 (owner: 10FNegri)
[15:13:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10akosiaris) Hi @VirginiaPoundstone, which username do you use to log in to turnilo?
[15:14:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) I've updated wikitech https://wikitech.wikimedia.org/w/index.php?title=Caching_overview&diff=2040756&oldid=2029875 to reflect that both X-Cache `hi...
[15:14:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2001.codfw.wmnet
[15:15:13] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED
[15:15:26] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[15:16:24] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1007.eqiad.wmnet
[15:17:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1008.eqiad.wmnet
[15:17:21] <wikibugs>	 10SRE, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10LSobanski)
[15:17:32] <wikibugs>	 10SRE, 10serviceops, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10LSobanski)
[15:18:22] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host logstash1036.mgmt.eqiad.wmnet with reboot policy FORCED
[15:18:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2001.codfw.wmnet
[15:19:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet
[15:20:04] <wikibugs>	 10SRE, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Platform Engineering (Needs Cleaning - Cassandra Operational): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329 (10LSobanski)
[15:20:07] <wikibugs>	 10SRE, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Patch-For-Review: Automated invocation of Cassandra repair jobs - https://phabricator.wikimedia.org/T92355 (10LSobanski)
[15:20:40] <wikibugs>	 10SRE, 10Patch-For-Review, 10Platform Engineering (Icebox): enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10LSobanski) 05Open→03Resolved a:03LSobanski I talked to Eric, this is no longer relevant.
[15:21:59] <wikibugs>	 10SRE, 10serviceops, 10Maps (Maps-data): Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10jijiki) 05Open→03Resolved a:03jijiki Given that this task was opened when the infra was completely different, I am bluntly closing this task. I am happy to re-open if/w...
[15:23:03] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin: Add mnz to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/868098 (https://phabricator.wikimedia.org/T325072)
[15:23:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1008.eqiad.wmnet
[15:24:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet
[15:24:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1009.eqiad.wmnet
[15:24:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2003.codfw.wmnet
[15:25:30] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[15:26:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10akosiaris) @Wangombe couple you please respond to the above comment? Many thanks!
[15:26:49] <wikibugs>	 10SRE, 10Cassandra, 10Security: Cookbook to reboot cassandra nodes - https://phabricator.wikimedia.org/T288975 (10Eevans)
[15:27:01] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10Sustainability: Consider using Cassandra/restbase in place of external store - https://phabricator.wikimedia.org/T100705 (10LSobanski) 05Open→03Declined I'm closing this as Declined. Given its age and the changes in Restbase it likely needs a new problem statement befor...
[15:28:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2003.codfw.wmnet
[15:30:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "We really need to find a way to generate the egress list programmatically. Thanks for this fix though!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867707 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[15:30:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1009.eqiad.wmnet
[15:30:32] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logstash1037.mgmt.eqiad.wmnet with reboot policy FORCED
[15:32:08] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logstash1036.mgmt.eqiad.wmnet with reboot policy FORCED
[15:32:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2001.wikimedia.org
[15:33:26] <wikibugs>	 10SRE: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10LSobanski) 05Open→03Resolved a:03LSobanski Considering the age of this task, we're probably safe to close it. Please reopen if you think otherwise.
[15:34:32] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[15:34:44] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:36:43] <wikibugs>	 (03PS1) 10Cmjohnson: Adding logstash1036-37 to site.pp and netboot cfg [puppet] - 10https://gerrit.wikimedia.org/r/868107 (https://phabricator.wikimedia.org/T313849)
[15:38:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2001.wikimedia.org
[15:39:27] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding logstash1036-37 to site.pp and netboot cfg [puppet] - 10https://gerrit.wikimedia.org/r/868107 (https://phabricator.wikimedia.org/T313849) (owner: 10Cmjohnson)
[15:40:37] <wikibugs>	 (03PS1) 10Clément Goubert: wmnet: Add aux-k8s-ingress.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/868100 (https://phabricator.wikimedia.org/T325178)
[15:41:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew) a:05cmooney→03Andrew
[15:41:54] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Regarding DNS updates, I am going to paste the comment I linked to in my last comment below so all of the information is in t...
[15:43:56] <icinga-wm>	 PROBLEM - Check systemd state on apt2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:44:18] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Migrate echostore & sessionstore staging to new cassandra-dev cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/867733 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans)
[15:44:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[15:44:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Looks good so far. I don't really like that it's called flink-*kubernetes*-operator because that's very obvious at this point, but probabl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:45:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[15:45:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:46:55] <wikibugs>	 (03PS5) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644
[15:47:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:48:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:48:23] <wikibugs>	 (03PS2) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625
[15:48:40] <wikibugs>	 (03CR) 10Andrew Bogott: puppetmasters: cache cleanup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott)
[15:48:56] <wikibugs>	 (03CR) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott)
[15:49:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:49:28] <icinga-wm>	 RECOVERY - Check systemd state on apt2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett)
[15:50:06] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate echostore & sessionstore staging to new cassandra-dev cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/867733 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans)
[15:50:12] <wikibugs>	 (03CR) 10Volans: [C: 03+2] config: allow to specify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[15:50:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:50:49] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply
[15:51:13] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[15:51:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) >>! In T323943#8466964, @akosiaris wrote: > Hello @sbassett, I see we are still missing some input here, any upda...
[15:52:40] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[15:53:43] <wikibugs>	 (03Merged) 10jenkins-bot: config: allow to specify multiple cookbooks paths [software/spicerack] - 10https://gerrit.wikimedia.org/r/868074 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[15:54:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[15:54:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) p:05Triage→03Medium
[15:55:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) As an update, we're hopeful of having the work for T299125 done by end of this quarter (with deployment early next now, given how close we are to the no-change window); that...
[15:55:38] <wikibugs>	 (03CR) 10Marostegui: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:58:28] <wikibugs>	 (03PS1) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178)
[15:58:42] <wikibugs>	 (03CR) 10Jbond: "thanks, and ping when ever for help with the puppet stuff" [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[15:59:04] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "looks good, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/868100 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[16:00:39] <jinxer-wm>	 (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:00:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[16:02:11] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host logstash1036.eqiad.wmnet with OS buster
[16:02:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster
[16:02:23] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:03:24] <wikibugs>	 (03Abandoned) 10FNegri: Remove non-wmcs files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868092 (owner: 10FNegri)
[16:04:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott)
[16:05:41] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] Added some comments about where/how cloud hiera settings are applied (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott)
[16:06:36] <wikibugs>	 (03PS1) 10Vgutierrez: wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561)
[16:06:52] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[16:10:26] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin::cloud_master: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867551 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[16:10:32] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1002.eqiad.wmnet
[16:11:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[16:15:34] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1002.eqiad.wmnet
[16:16:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 1.63M messages - TODO  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[16:16:34] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:16:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:17:02] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[16:17:57] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[16:19:20] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon2002.codfw.wmnet
[16:19:59] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) I've created a CR https://gerrit.wikimedia.org/r/c/operations/dns/+/868103 to add the DNS records to the wikimedia.org DNS zone...
[16:21:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 464k messages - TODO  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[16:21:34] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:22:21] <hashar>	 I will promote group 1 wikis to 1.40.0-wmf.14 in a few minutes (at 16:30 UTC)
[16:22:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[16:23:32] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:23:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder)
[16:24:15] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2002.codfw.wmnet
[16:25:15] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog1001.eqiad.wmnet
[16:25:56] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[16:27:25] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2028']
[16:28:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash2028 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc9d1c82278: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[16:28:22] <icinga-wm>	 org/wiki/Search%23Administration
[16:29:26] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:30:04] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:31:34] <jinxer-wm>	 (KeyholderUnarmed) resolved: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:31:54] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:32:03] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[16:33:08] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:33:40] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1001.eqiad.wmnet
[16:34:15] <hashar>	 I am promoting group 1 wikis now
[16:34:19] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868104 (https://phabricator.wikimedia.org/T320519)
[16:34:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868104 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot)
[16:35:09] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868104 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot)
[16:35:24] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2028']
[16:36:52] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host dispatch-be1001.eqiad.wmnet
[16:36:54] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2028']
[16:38:37] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:40:24] <icinga-wm>	 PROBLEM - Host logstash2028 is DOWN: PING CRITICAL - Packet loss = 100%
[16:40:35] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dispatch-be1001.eqiad.wmnet
[16:41:56] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet
[16:41:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2001.codfw.wmnet
[16:42:35] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[16:42:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1003.eqiad.wmnet
[16:43:12] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.14  refs T320519
[16:43:13] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host orespoolcounter1003.eqiad.wmnet
[16:43:15] <stashbot>	 T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519
[16:43:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1003.eqiad.wmnet
[16:43:43] <wikibugs>	 (03CR) 10Marostegui: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:43:56] <logmsgbot>	 !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash2028']
[16:44:06] <icinga-wm>	 RECOVERY - Host logstash2028 is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms
[16:44:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloudcumin1001 - https://phabricator.wikimedia.org/T323516 (10fnegri) 05Open→03Resolved
[16:44:16] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[16:44:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumincloud2001 - https://phabricator.wikimedia.org/T323518 (10fnegri) 05Open→03Resolved
[16:44:32] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[16:44:58] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash2028 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 662, active_shards: 1464, relocating_shards: 4, initializing_shards: 0, unassigned_shards: 1, delayed_unassigned_sha
[16:44:58] <icinga-wm>	 number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.93174061433447 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:45:59] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[16:46:15] <wikibugs>	 (03PS1) 10Eevans: sessionstore: Update egress rules for staging database [deployment-charts] - 10https://gerrit.wikimedia.org/r/868126 (https://phabricator.wikimedia.org/T324113)
[16:46:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10nskaggs) As someone without global root who has been a test case in the past for this, allowing wmcs* cookbook runs for a subset of user...
[16:46:27] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2009.codfw.wmnet
[16:47:16] <wikibugs>	 (03Abandoned) 10FNegri: cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond)
[16:47:34] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1003.eqiad.wmnet
[16:47:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[16:47:59] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[16:48:09] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2001.codfw.wmnet
[16:48:15] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2028.codfw.wmnet with OS bullseye
[16:48:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2002.codfw.wmnet
[16:48:48] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet
[16:49:19] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[16:49:23] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10hnowlan) 05Open→03Resolved
[16:49:41] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet
[16:50:19] <logmsgbot>	 !log hashar@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.14  refs T320519 (duration: 07m 06s)
[16:50:22] <stashbot>	 T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519
[16:50:54] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:51:29] <jayme>	 hnowlan: ^^
[16:51:29] <wikibugs>	 (03PS1) 10Daniel Kinzler: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127
[16:51:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 (owner: 10Daniel Kinzler)
[16:52:20] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1004.eqiad.wmnet
[16:52:22] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2009.codfw.wmnet
[16:52:30] <hnowlan>	 jayme: ack, thanks
[16:52:34] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[16:53:23] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2008.codfw.wmnet
[16:55:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2002.codfw.wmnet
[16:55:59] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1004.eqiad.wmnet
[16:56:38] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet
[16:57:55] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus4001.ulsfo.wmnet
[16:58:05] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[16:58:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1036.eqiad.wmnet with OS buster
[16:58:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster executed with e...
[16:58:37] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2003.codfw.wmnet
[16:59:26] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2003.codfw.wmnet
[17:01:57] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2029']
[17:02:25] <icinga-wm>	 PROBLEM - Checks that the airflow database for airflow research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[17:03:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2003.codfw.wmnet
[17:03:53] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4001.ulsfo.wmnet
[17:04:49] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "looks good to me, though I would like someone on traffic to weigh in, or someone with more familiarity on our ingress setup." [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[17:05:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2003.codfw.wmnet
[17:05:37] <icinga-wm>	 RECOVERY - Checks that the airflow database for airflow research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[17:06:03] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wmnet: Add aux-k8s-ingress.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/868100 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[17:06:38] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2004.codfw.wmnet
[17:06:44] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet
[17:07:27] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2008.codfw.wmnet
[17:07:41] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2007.codfw.wmnet
[17:08:44] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[17:08:46] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2028.codfw.wmnet with reason: host reimage
[17:08:48] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus5001.eqsin.wmnet
[17:08:48] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2029']
[17:08:58] <mutante>	 !log planet2002 - rebooting 
[17:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:42] <mutante>	 !log planet1002 - rebooting 
[17:09:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:29] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2004.codfw.wmnet
[17:10:49] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add aux-k8s-ingress VIP - cgoubert@cumin1001"
[17:11:43] <mutante>	 !log doc2001 - rebooting
[17:11:43] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2029']
[17:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:46] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2028.codfw.wmnet with reason: host reimage
[17:11:54] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add aux-k8s-ingress VIP - cgoubert@cumin1001"
[17:11:54] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:12:58] <mutante>	 !log https://doc.wikimedia.org - maybe a few seconds of downtime
[17:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:06] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet
[17:13:42] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes1011.eqiad.wmnet
[17:13:55] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2007.codfw.wmnet
[17:14:09] <icinga-wm>	 PROBLEM - Host logstash2029 is DOWN: PING CRITICAL - Packet loss = 100%
[17:15:02] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10Volans) Indeed, I agree that we might need later on some more fine-tuned way to authorize things. That said the new cloudcumin setup wil...
[17:15:13] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5001.eqsin.wmnet
[17:16:26] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2006.codfw.wmnet
[17:17:02] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: Update egress rules for staging database [deployment-charts] - 10https://gerrit.wikimedia.org/r/868126 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans)
[17:18:30] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2029']
[17:18:43] <icinga-wm>	 RECOVERY - Host logstash2029 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms
[17:18:56] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus6001.drmrs.wmnet
[17:19:00] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=kubernetes1011.eqiad.wmnet
[17:19:01] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2029.codfw.wmnet with OS bullseye
[17:21:14] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2006.codfw.wmnet
[17:21:32] <wikibugs>	 (03Merged) 10jenkins-bot: sessionstore: Update egress rules for staging database [deployment-charts] - 10https://gerrit.wikimedia.org/r/868126 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans)
[17:22:00] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2005.codfw.wmnet
[17:22:13] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[17:22:39] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[17:24:19] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:25:07] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6001.drmrs.wmnet
[17:27:29] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2005.codfw.wmnet
[17:28:16] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet
[17:33:47] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet
[17:34:45] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2028.codfw.wmnet with OS bullseye
[17:38:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott)
[17:38:16] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2029.codfw.wmnet with reason: host reimage
[17:41:25] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2029.codfw.wmnet with reason: host reimage
[17:41:56] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/868139
[17:42:13] <wikibugs>	 (03CR) 10Hokwelum: [C: 03+1] "Ariel and I looked at the dumps related file and from the PCC run, it looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) (owner: 10Effie Mouzeli)
[17:44:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] scap: disable git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar)
[17:45:07] <effie>	 !log disable puppet on all P:mediawiki::nutcracker hosts (killing nutcracker on mw)
[17:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/868139 (owner: 10Volans)
[17:48:24] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "the code to manage the config file should be in the same place where the package is installed, imo" [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[17:48:59] <wikibugs>	 (03PS5) 10Effie Mouzeli: mediawiki: Goodbye Nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183)
[17:49:11] <wikibugs>	 (03PS6) 10Effie Mouzeli: mediawiki: Goodbye Nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183)
[17:49:13] <wikibugs>	 (03CR) 10David Caro: tools-webservice: create /etc/toolforge/webservice.yaml with puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[17:49:46] <claime>	 effie: \o/
[17:49:50] <claime>	 kill kill kill
[17:49:54] <effie>	 hehe
[17:49:55] <wikibugs>	 (03PS1) 10Volans: Upstream release v6.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/868142
[17:50:41] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] tools-webservice: create /etc/toolforge/webservice.yaml with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[17:51:43] <thcipriani>	 jouncebot: now
[17:51:43] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 8 minute(s)
[17:52:09] <wikibugs>	 (03CR) 10David Caro: tools-webservice: create /etc/toolforge/webservice.yaml with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[17:52:19] <thcipriani>	 duesen: which group are you targeting with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/868136 ?
[17:52:22] <duesen>	 thcipriani: o/
[17:52:37] <thcipriani>	 current lay of the land: https://versions.toolforge.org/
[17:52:39] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Goodbye Nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) (owner: 10Effie Mouzeli)
[17:53:47] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] tools-webservice: create /etc/toolforge/webservice.yaml with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[17:54:34] <mutante>	 effie: congratulations on removing nutcracker, that's been around forever
[17:54:44] <mutante>	 must be a milestone as well
[17:54:57] <thcipriani>	 oh! nice, kudos :)
[17:56:08] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+2] "for deploy" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[17:56:40] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v6.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/868142 (owner: 10Volans)
[17:56:51] <duesen>	 thcipriani: it's be 30 minutes until this merges I suppose --^^
[17:57:09] <duesen>	 I guess we won't make it in time
[17:57:24] <thcipriani>	 duesen: bah, yeah, I guess that's true. Let's meet back here after and get it out?
[17:58:10] <duesen>	 I'm meeting with Subbu after... but I can verify on the side :)
[17:58:21] <duesen>	 It's not super urgent. It would just be nice to see that it works.
[17:58:40] <duesen>	 I guess we should deploy it when it's merged into the branch...
[18:00:25] <thcipriani>	 works for me. I'll keep an eye on it and ping you.
[18:00:36] <duesen>	 cool, thanks
[18:01:09] <volans>	 !log uploaded spicerack_6.0.0 to apt.wikimedia.org bullseye-wikimedia
[18:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:10] <thcipriani>	 duesen: well. Looks like a lot of test failures :\
[18:03:51] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2029.codfw.wmnet with OS bullseye
[18:04:00] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[18:06:00] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2027']
[18:06:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Hi all,  I apologize for the latency, I will be working on this today.  Thanks teammates,  Kelton Hurd Wikimedia...
[18:07:03] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet
[18:08:18] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash2027 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f0217406278: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[18:08:18] <icinga-wm>	 org/wiki/Search%23Administration
[18:08:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[18:09:53] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2001.codfw.wmnet with OS bullseye
[18:11:30] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash2027 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 9, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 662, active_shards: 1456, relocating_shards: 20, initializing_shards: 0, unassigned_shards: 9, delayed_unassigned_sha
[18:11:30] <icinga-wm>	 number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.38566552901024 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:12:46] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2027']
[18:14:22] <icinga-wm>	 PROBLEM - nutcracker process on mw1447 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:14:32] <icinga-wm>	 PROBLEM - nutcracker socket on parse2002 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:14:34] <effie>	 please ignore the nutcracker alers
[18:14:36] <icinga-wm>	 PROBLEM - nutcracker socket on parse2001 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:14:42] <icinga-wm>	 PROBLEM - nutcracker socket on mwdebug1002 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:14:42] <effie>	 they will be cleared soon 
[18:14:46] <icinga-wm>	 PROBLEM - nutcracker socket on mw1447 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:02] <icinga-wm>	 PROBLEM - nutcracker socket on mw1448 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:02] <icinga-wm>	 PROBLEM - nutcracker process on parse1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:04] <icinga-wm>	 PROBLEM - nutcracker process on mw1449 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:06] <icinga-wm>	 PROBLEM - nutcracker process on mwdebug1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:08] <icinga-wm>	 PROBLEM - nutcracker process on mw2271 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:08] <icinga-wm>	 PROBLEM - nutcracker process on mw1448 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:10] <icinga-wm>	 PROBLEM - nutcracker socket on mw2272 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:14] <icinga-wm>	 PROBLEM - nutcracker process on parse1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:14] <icinga-wm>	 PROBLEM - nutcracker socket on mw2374 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:16] <icinga-wm>	 PROBLEM - nutcracker socket on mw2271 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:16] <icinga-wm>	 PROBLEM - nutcracker socket on parse1001 is CRITICAL: connect to file socket /var/run/nutcracker/redis_eqiad.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:16] <icinga-wm>	 PROBLEM - nutcracker process on parse2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:20] <icinga-wm>	 PROBLEM - nutcracker socket on mwdebug2002 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:28] <icinga-wm>	 PROBLEM - nutcracker socket on mw2376 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:15:28] <icinga-wm>	 PROBLEM - nutcracker socket on mwdebug2001 is CRITICAL: connect to file socket /var/run/nutcracker/redis_codfw.sock: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[18:16:28] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2027']
[18:16:28] <wikibugs>	 (03PS2) 10Daniel Kinzler: Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049
[18:19:20] <icinga-wm>	 PROBLEM - Host logstash2027 is DOWN: PING CRITICAL - Packet loss = 100%
[18:22:53] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2027']
[18:23:18] <wikibugs>	 (03PS1) 10Eevans: echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146
[18:23:30] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2027.codfw.wmnet with OS bullseye
[18:27:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:28:04] <wikibugs>	 (03CR) 10Eevans: "Seeing the full list of nodes for RESTBase (our largest cluster) makes me wish that we allocated subnets for clusters like this. 😢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 (owner: 10Eevans)
[18:28:36] <wikibugs>	 (03PS2) 10Vgutierrez: wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561)
[18:29:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561) (owner: 10Vgutierrez)
[18:30:25] <wikibugs>	 (03PS3) 10Vgutierrez: wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561)
[18:31:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] wikimedia.org: Add links.email related DNS records [dns] - 10https://gerrit.wikimedia.org/r/868103 (https://phabricator.wikimedia.org/T188561) (owner: 10Vgutierrez)
[18:32:35] <wikibugs>	 (03PS2) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178)
[18:35:10] <wikibugs>	 (03PS2) 10Raymond Ndibe: tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689)
[18:35:16] <icinga-wm>	 PROBLEM - Check systemd state on parse1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:35:47] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) DNS records are now live: `$ host -t cname links.email.wikimedia.org links.email.wikimedia.org is an alias for recp.mkt41.net.`...
[18:36:28] <wikibugs>	 (03CR) 10Raymond Ndibe: tools-webservice: create /etc/toolforge/webservice.yaml with puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[18:37:00] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2001.codfw.wmnet with reason: host reimage
[18:37:10] <wikibugs>	 (03PS1) 10Effie Mouzeli: cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147
[18:37:45] <wikibugs>	 (03PS3) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178)
[18:38:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147 (owner: 10Effie Mouzeli)
[18:39:02] <icinga-wm>	 PROBLEM - Check systemd state on parse2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:39:58] <wikibugs>	 (03PS4) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178)
[18:40:13] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2001.codfw.wmnet with reason: host reimage
[18:40:55] <wikibugs>	 (03PS1) 10Southparkfan: rsyslog: use ensure_resource for package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623)
[18:41:42] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38779/console" [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert)
[18:42:41] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2027.codfw.wmnet with reason: host reimage
[18:45:49] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2027.codfw.wmnet with reason: host reimage
[18:48:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto)
[18:49:32] <wikibugs>	 (03PS2) 10Effie Mouzeli: cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147
[18:49:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I like it, I have been confused by "trusted vs protected" myself in the past" [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto)
[18:50:25] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582)
[18:50:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[18:51:00] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582)
[18:52:10] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[18:53:36] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Thank you, @Vgutierrez. I've alerted Acoustic support that the updates have been made and I will follow up here when they pro...
[18:53:55] <wikibugs>	 (03CR) 10Dzahn: "I am not sure, I feel like this might open it up to more changes like this in the future. Once the projects start drifting more we'd repea" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[18:57:50] <duesen>	 thcipriani: I scheduled it fix for the regular backport window in two hours. 
[18:59:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yea, merge it but I think it also needs some social contract how to deal with it in the future" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[19:00:05] <jouncebot>	 hashar and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1900).
[19:00:05] <jouncebot>	 hashar and ^demon: Your horoscope predicts another unfortunate MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T1900).
[19:01:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "either way I think we should not keep this alert unless there is coordination between the 3 involved stakeholders, SRE/observability, lega" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[19:02:47] <icinga-wm>	 PROBLEM - Check systemd state on mw1449 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:38] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2027.codfw.wmnet with OS bullseye
[19:11:39] <icinga-wm>	 PROBLEM - Disk space on aphlict1001 is CRITICAL: DISK CRITICAL - free space: / 667 MB (3% inode=91%): /tmp 667 MB (3% inode=91%): /var/tmp 667 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops
[19:12:14] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host logstash1037.eqiad.wmnet with OS buster
[19:12:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1037.eqiad.wmnet with OS buster
[19:14:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] scap: add stanza for jenkins-ci and jenkins-releases deploy [puppet] - 10https://gerrit.wikimedia.org/r/867294 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[19:15:14] <mutante>	 sigh, I will look at disk space on aphlict1001
[19:15:17] <mutante>	 re: alert above
[19:15:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] cloudweb: putting nutcracker.pp back as cloudweb hosts were using it [puppet] - 10https://gerrit.wikimedia.org/r/868147 (owner: 10Effie Mouzeli)
[19:15:23] <RhinosF1>	 mutante: was just about to ping you to look
[19:15:44] <mutante>	 RhinosF1: :) thanks
[19:16:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Cmjohnson) @Jclark-ctr I am getting a media test failure for logstash1037, can you check the cable please  logstash1037 F1 U26 Port 26
[19:16:41] <RhinosF1>	 mutante: see https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad%20prometheus%2Fops&orgId=1&viewPanel=28&from=now-7d&to=now, big spike yesterday
[19:16:47] <RhinosF1>	 Maybe logs again?
[19:17:07] <RhinosF1>	 Or a failed rotate
[19:17:38] <RhinosF1>	 Similar spike 1st/2nd
[19:17:41] <mutante>	 RhinosF1: yes, it is
[19:19:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host logstash1037.eqiad.wmnet with OS buster
[19:19:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1037.eqiad.wmnet with OS buster executed with e...
[19:21:02] <mutante>	 !log aphlict1001 - :/var/log/aphlict# gzip aphlict.log.1
[19:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) Hi Alexandros. The permissions for he.wikisource.org were updated, but not for he.m.wikisource.org. Thanks.
[19:27:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host logstash1036.eqiad.wmnet with OS buster
[19:27:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster
[19:30:52] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2001.codfw.wmnet with OS bullseye
[19:30:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10Cmjohnson) Jclark-ctr I am also getting a media test failure on logstash1036, the DAC cable may be plugged into the wrong port.
[19:32:25] <icinga-wm>	 RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops
[19:33:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @jclark-ctr Can you try reseating the nic if that is possible
[19:43:25] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:46:10] <wikibugs>	 (03PS1) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[19:46:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[19:54:18] <wikibugs>	 (03PS2) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[19:54:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[19:56:22] <wikibugs>	 (03PS3) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[19:58:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[19:59:55] <wikibugs>	 (03PS1) 10Gergő Tisza: UserEditTracker: Allow querying primary DB for edit timestamp [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868051
[20:00:17] <wikibugs>	 (03PS1) 10Gergő Tisza: User impact: read edit count from primary db in save complete hook [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868052 (https://phabricator.wikimedia.org/T324930)
[20:00:35] <wikibugs>	 (03PS4) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[20:04:30] <wikibugs>	 (03PS3) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625
[20:04:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Add wmcs::openstack::eqiad1::virt_ceph to new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868159 (https://phabricator.wikimedia.org/T313983)
[20:05:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott)
[20:06:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add wmcs::openstack::eqiad1::virt_ceph to new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868159 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott)
[20:06:31] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[20:11:10] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host logstash1036.eqiad.wmnet with OS buster
[20:11:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host logstash1036.eqiad.wmnet with OS buster executed with e...
[20:12:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[20:13:55] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[20:14:13] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye
[20:14:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmn...
[20:17:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:19:59] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1054.eqiad.wmnet with OS bullseye
[20:20:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet w...
[20:29:20] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye
[20:29:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmn...
[20:31:29] <wikibugs>	 (03PS5) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[20:32:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew) I enabled virtualization in the bios processor settings for each of these hosts.
[20:33:19] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bullseye
[20:33:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet with OS bullseye
[20:33:36] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS bullseye
[20:33:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet with OS bullseye
[20:36:42] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bullseye
[20:36:44] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1058.eqiad.wmnet with OS bullseye
[20:36:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1057.eqiad.wmnet with OS bullseye
[20:36:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1058.eqiad.wmnet with OS bullseye
[20:38:50] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1059.eqiad.wmnet with OS bullseye
[20:38:52] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1060.eqiad.wmnet with OS bullseye
[20:38:53] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS bullseye
[20:38:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1059.eqiad.wmnet with OS bullseye
[20:39:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1060.eqiad.wmnet with OS bullseye
[20:39:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1061.eqiad.wmnet with OS bullseye
[20:42:00] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[20:42:53] <wikibugs>	 (03PS6) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[20:45:18] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[20:46:04] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[20:46:22] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage
[20:47:57] <wikibugs>	 (03PS7) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[20:49:10] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[20:49:28] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage
[20:49:30] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage
[20:51:39] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage
[20:51:41] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage
[20:51:42] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage
[20:51:48] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1054.eqiad.wmnet with OS bullseye
[20:51:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet with OS bullseye execut...
[20:51:57] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage
[20:53:35] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1055.eqiad.wmnet with OS bullseye
[20:53:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet with OS bullseye execut...
[20:54:02] <wikibugs>	 (03PS1) 10Andrew Bogott: Add hiera host defs for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868166 (https://phabricator.wikimedia.org/T313983)
[20:54:07] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage
[20:54:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add hiera host defs for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/868166 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott)
[20:55:19] <wikibugs>	 (03PS8) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[20:55:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[20:56:05] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bullseye
[20:56:06] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye
[20:56:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmn...
[20:56:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmn...
[20:56:32] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage
[20:57:29] <wikibugs>	 (03PS9) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[20:58:50] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage
[20:58:51] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder)
[20:59:30] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221214T2100).
[21:00:04] <jouncebot>	 duesen, subbu, kemayo, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:21] <subbu>	 o/
[21:00:33] * TheresNoTime can deploy, are there any self-service patches?
[21:00:42] <Kemayo>	 o/
[21:00:52] <tgr>	 o/ I can self-service, I added more patches than allowed
[21:01:21] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage
[21:01:30] <TheresNoTime>	 tgr: sure, I'll ping you when ready?
[21:01:51] <TheresNoTime>	 subbu: will start with yours
[21:01:57] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:02:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[21:02:42] <Kemayo>	 Mine do collapse down okay - it's the same patch applied to .13 and .14, and a config patch that makes it actually do something.
[21:02:43] <subbu>	 sure. ty
[21:03:18] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167
[21:05:20] <wikibugs>	 (03PS10) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[21:05:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[21:06:04] <wikibugs>	 (03PS2) 10Andrew Bogott: OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167
[21:06:58] <wikibugs>	 (03PS11) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[21:07:32] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/868167/38790/" [puppet] - 10https://gerrit.wikimedia.org/r/868167 (owner: 10Andrew Bogott)
[21:07:47] <duesen>	 subbu: I'm helping my daughter with homework (learning sql, yay). So I'm around if anything comes up.
[21:08:00] <subbu>	 sounds good! :-)
[21:09:49] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[21:09:52] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[21:12:52] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[21:14:03] <TheresNoTime>	 subbu: your patch failed https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/11930/console#console-section-16 
[21:14:25] <subbu>	 Yes, i saw .. can you retry?
[21:14:51] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[21:15:04] <subbu>	 I am not sure why selenium tests would fail here .. i expect it is something transient.
[21:15:18] <TheresNoTime>	 subbu: ack will do
[21:15:21] <subbu>	 ty
[21:16:20] <TheresNoTime>	 subbu: if it's just that, worth forcing it through with a V+2?
[21:16:24] <TheresNoTime>	 (I think that works?)
[21:16:59] <RhinosF1>	 TheresNoTime: why not recheck?
[21:17:01] <subbu>	 no, let us retry .. just in case.
[21:17:10] <TheresNoTime>	 sure :)
[21:17:38] <logmsgbot>	 !log samtar@deploy1002 backport aborted:  (duration: 15m 35s)
[21:18:18] <wikibugs>	 (03PS12) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[21:18:21] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy, retry" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[21:18:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[21:19:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[21:19:16] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudvirt1056.eqiad.wmnet with OS bullseye
[21:19:25] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:19:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet w...
[21:19:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet w...
[21:19:34] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudvirt1057.eqiad.wmnet with OS bullseye
[21:19:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1057.eqiad.wmnet w...
[21:19:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1057.eqiad.wmnet w...
[21:19:50] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "recheck" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[21:20:23] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867619 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[21:20:25] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867620 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[21:20:27] <RhinosF1>	 subbu: it was a growth experiments test that failed
[21:20:36] <wikibugs>	 (03PS13) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[21:21:24] <RhinosF1>	 tgr: in case it fails again, might be worth you looking ^ or do you have a flappy test?
[21:22:01] <icinga-wm>	 PROBLEM - Check systemd state on mw2272 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:03] <TheresNoTime>	 Kemayo: while that is retrying, I can do yours if you're available
[21:22:11] <Kemayo>	 TheresNoTime: Sure thing
[21:22:48] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1061.eqiad.wmnet with OS bullseye
[21:22:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1061.eqiad.wmnet w...
[21:23:03] <RhinosF1>	 Error: element (".oo-ui-messageDialog-message") still not displayed after 5000ms
[21:23:07] <tgr>	 All selenium tests are flappy. Those weren't GrowthExperiments tests though.
[21:23:28] <RhinosF1>	 tgr: /workspace/src/extensions/GrowthExperiments/node_modules/webdriverio/build/commands/browser/waitUntil.js:66:23 is the path it gave
[21:23:36] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1058.eqiad.wmnet with OS bullseye
[21:23:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1058.eqiad.wmnet w...
[21:23:49] <RhinosF1>	 It failed on the Growth Experiments step
[21:24:09] <RhinosF1>	 And all flappy sounds fun
[21:24:33] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1059.eqiad.wmnet with OS bullseye
[21:24:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1059.eqiad.wmnet w...
[21:26:05] <wikibugs>	 10SRE, 10Product-Infrastructure-Team-Backlog, 10Security: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813 (10LSobanski)
[21:26:41] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1060.eqiad.wmnet with OS bullseye
[21:27:19] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 04-1] "Not to be merged just an example." [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[21:27:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1060.eqiad.wmnet w...
[21:29:37] <TheresNoTime>	 Kemayo: oh maybe not, it's queued behind that core patch (: sorry
[21:29:52] <Kemayo>	 🥲
[21:32:02] * TheresNoTime picked a great window to record with https://github.com/faressoft/terminalizer for docs (:
[21:35:40] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS bullseye
[21:35:53] <icinga-wm>	 PROBLEM - Check systemd state on mw1418 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet w...
[21:36:55] <tgr>	 This is the test that failed (VE toolbar special characters button):
[21:36:58] <tgr>	 https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/11930/artifact/log/Toolbar-should-open-special-characters-menu-2022-12-14T21-11-48-713Z.mp4
[21:37:11] <tgr>	 the recording is not very informative though
[21:37:51] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:38:40] <subbu>	 Given that Daniel's patch is supposed to improve VE latencies ... that failure may potentially be pertinent ... but I expect it is just a flappy failure in reality.
[21:39:00] <tgr>	 it doesn't look like opening the special characters menu does a backend request, so it's probably unrelated
[21:39:15] <tgr>	 (but yeah VE failing on a VE fix patch is a bit scary)
[21:39:19] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS bullseye
[21:39:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet w...
[21:39:30] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1054 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:40:06] <subbu>	 tgr but this is a cherry pick ... the original patch merged ... so there is also that.
[21:40:09] <tgr>	 failed selenium tests are repeated once, so it shouldn't be *that* flappy
[21:40:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) @jbond I've uplaoded a separate patch (above) that makes a stab and working this closer to how we discussed earlier.  It defi...
[21:40:21] <subbu>	 ok
[21:40:39] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1058 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:40:56] <tgr>	 well, I guess it's easy enough to check that button on mwdebug
[21:41:32] <wikibugs>	 (03Merged) 10jenkins-bot: Parsoid: don't bypass ParserCache when using Title [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[21:41:35] <wikibugs>	 (03Merged) 10jenkins-bot: VisualEnhancements: in some languages put an arrow by the reply button [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867619 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[21:41:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868049 (owner: 10Daniel Kinzler)
[21:41:39] <subbu>	 well, it merged i think. :)
[21:41:50] * subbu was watching zuul
[21:42:06] * duesen curses at Selenium
[21:42:17] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:868049|Parsoid: don't bypass ParserCache when using Title]]
[21:44:05] <logmsgbot>	 !log samtar@deploy1002 samtar and daniel: Backport for [[gerrit:868049|Parsoid: don't bypass ParserCache when using Title]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[21:44:10] <TheresNoTime>	 subbu: live on mwdebug ^
[21:44:22] <subbu>	 ok .. will test.
[21:45:10] <tgr>	 let me check the GrowthExperiments test (that failed once, passed on auto-retry)
[21:46:43] <subbu>	 VE is still functional .. so looks good to continue.
[21:47:09] <tgr>	 eh, I don't think GE image recommendations are set up on any wmf.14 wiki
[21:47:14] <TheresNoTime>	 subbu: ack, tgr are you testing something or is that separate from this
[21:47:23] <tgr>	 wanted to, but can't
[21:47:32] <TheresNoTime>	 okay, will sync
[21:47:36] <tgr>	 will double-check tomorrow just in case
[21:48:08] <tgr>	 (but the test passed 3 out of times so it's very likely the usual selenium timing thing)
[21:48:19] <tgr>	 ...out of 4...
[21:48:38] <subbu>	 Ack
[21:48:39] <wikibugs>	 (03Merged) 10jenkins-bot: VisualEnhancements: in some languages put an arrow by the reply button [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867620 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[21:51:41] <TheresNoTime>	 Kemayo: after this I'll roll 619/620 into one deploy and then do the config patch — sound okay?
[21:51:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew) 05Open→03Resolved I've imaged all these servers and put 1054 in the 'ceph' pool and the others in the 'spare'...
[21:51:51] <Kemayo>	 Sounds good to me.
[21:53:30] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:868049|Parsoid: don't bypass ParserCache when using Title]] (duration: 11m 13s)
[21:53:33] <TheresNoTime>	 subbu: live in prod
[21:53:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/867619 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[21:53:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/867620 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[21:53:50] <subbu>	 ty! duesen ^ if you want to test your hebrew skills. :)
[21:54:14] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:867619|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]], [[gerrit:867620|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]]
[21:54:17] <stashbot>	 T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537
[21:55:04] <TheresNoTime>	 Kemayo: those two are live on mwdebug
[21:55:32] <Kemayo>	 But no the config patch yet, right?
[21:55:48] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg) Thanks @Vgutierrez !
[21:56:00] <TheresNoTime>	 yes true, so can't be tested yet I suppose?
[21:56:02] <logmsgbot>	 !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:867619|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]], [[gerrit:867620|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:56:59] <Kemayo>	 TheresNoTime: I've verified that it's somewhat working via `uselang` elsewhere, so it's looking promising.
[21:57:11] <TheresNoTime>	 will sync :)
[21:57:32] <wikibugs>	 (03PS5) 10Samtar: Deployment of DiscussionTools reply visual enhancements for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[21:59:29] * subbu is signing off for a bit and back online in about 15 mins.
[21:59:34] <TheresNoTime>	 o/
[22:03:09] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:867619|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]], [[gerrit:867620|VisualEnhancements: in some languages put an arrow by the reply button (T323537)]] (duration: 08m 55s)
[22:03:13] <stashbot>	 T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537
[22:03:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[22:04:08] <wikibugs>	 (03Merged) 10jenkins-bot: Deployment of DiscussionTools reply visual enhancements for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867311 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[22:04:38] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:867311|Deployment of DiscussionTools reply visual enhancements for more wikis (T323537)]]
[22:06:23] <logmsgbot>	 !log samtar@deploy1002 samtar and kemayo: Backport for [[gerrit:867311|Deployment of DiscussionTools reply visual enhancements for more wikis (T323537)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[22:06:25] <TheresNoTime>	 Kemayo: okay, config patch is on mwdebug
[22:06:55] <Kemayo>	 TheresNoTime: Looks good!
[22:07:03] <TheresNoTime>	 cool, syncing :)
[22:12:51] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:867311|Deployment of DiscussionTools reply visual enhancements for more wikis (T323537)]] (duration: 08m 12s)
[22:12:54] <TheresNoTime>	 Kemayo: live in prod
[22:12:55] <stashbot>	 T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537
[22:13:02] <TheresNoTime>	 tgr: all yours if you still want to 
[22:13:16] <tgr>	 thanks!
[22:13:25] <Kemayo>	 TheresNoTime: Thanks!
[22:13:36] <subbu>	 duesen, the patch may have done the trick ... the latency is now in the much lower range ... will probably be clearer by tomorrow.
[22:15:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10bd808) Related: {T148048}
[22:16:23] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868047 (https://phabricator.wikimedia.org/T325041) (owner: 10Kosta Harlan)
[22:16:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868051 (owner: 10Gergő Tisza)
[22:24:04] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T286164)
[22:24:35] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:28:45] <wikibugs>	 (03PS2) 10BryanDavis: toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T195681)
[22:32:39] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert1001.wikimedia.org
[22:36:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2009.codfw.wmnet with reason: NFS troubleshooting
[22:36:05] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on wdqs2009.codfw.wmnet with reason: NFS troubleshooting
[22:36:54] <wikibugs>	 (03Merged) 10jenkins-bot: NewImpact: Add log event for clicking suggested edits button [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868047 (https://phabricator.wikimedia.org/T325041) (owner: 10Kosta Harlan)
[22:36:59] <wikibugs>	 (03Merged) 10jenkins-bot: UserEditTracker: Allow querying primary DB for edit timestamp [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868051 (owner: 10Gergő Tisza)
[22:37:26] <logmsgbot>	 !log tgr@deploy1002 Started scap: Backport for [[gerrit:868047|NewImpact: Add log event for clicking suggested edits button (T325041)]], [[gerrit:868051|UserEditTracker: Allow querying primary DB for edit timestamp]]
[22:37:31] <stashbot>	 T325041: Bring NewImpact logging on par with old Impact - https://phabricator.wikimedia.org/T325041
[22:39:14] <logmsgbot>	 !log tgr@deploy1002 tgr and kharlan and tgr: Backport for [[gerrit:868047|NewImpact: Add log event for clicking suggested edits button (T325041)]], [[gerrit:868051|UserEditTracker: Allow querying primary DB for edit timestamp]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[22:43:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10mpopov) I just updated @Fuzzy's permissions for he.m.wikisource. Used to be 'Restricted' but 'Full' now (same as he.wikisource).  >...
[22:46:22] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host alert1001.wikimedia.org
[22:47:14] <denisse>	 ^ it was expected that the last test fails as it connects to external services.
[22:47:20] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:47:22] <denisse>	 I'm already looking at the hosts health.
[22:48:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[22:48:50] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.203 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:49:03] <logmsgbot>	 !log tgr@deploy1002 Finished scap: Backport for [[gerrit:868047|NewImpact: Add log event for clicking suggested edits button (T325041)]], [[gerrit:868051|UserEditTracker: Allow querying primary DB for edit timestamp]] (duration: 11m 37s)
[22:49:07] <stashbot>	 T325041: Bring NewImpact logging on par with old Impact - https://phabricator.wikimedia.org/T325041
[22:53:17] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T195681) (owner: 10BryanDavis)
[22:55:04] <icinga-wm>	 PROBLEM - Check systemd state on mw2374 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:56:20] <tgr>	 !log doing the last backport by hand due to T325252
[22:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:24] <stashbot>	 T325252: scap backport fails with "Multiple changes found" - https://phabricator.wikimedia.org/T325252
[22:57:54] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:58:34] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: bump container to 2022-12-14-185830-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868183 (https://phabricator.wikimedia.org/T195681) (owner: 10BryanDavis)
[22:58:48] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:59:41] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply
[23:00:48] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:01:02] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[23:03:10] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[23:03:17] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert2001.wikimedia.org
[23:03:18] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2001.wikimedia.org
[23:03:42] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] User impact: read edit count from primary db in save complete hook [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868052 (https://phabricator.wikimedia.org/T324930) (owner: 10Gergő Tisza)
[23:04:41] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[23:08:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[23:10:41] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on alert2001.wikimedia.org with reason: kernel update
[23:10:42] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[23:10:42] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on alert2001.wikimedia.org with reason: kernel update
[23:12:13] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[23:12:44] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:15] <bd808>	 !log Toolhub: rebuilding search indices following app update
[23:14:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:48] <wikibugs>	 (03PS1) 10Bking: [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114)
[23:17:35] <wikibugs>	 (03PS1) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375)
[23:18:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[23:20:24] <wikibugs>	 (03CR) 10Herron: [C: 03+1] librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[23:20:51] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: read edit count from primary db in save complete hook [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868052 (https://phabricator.wikimedia.org/T324930) (owner: 10Gergő Tisza)
[23:21:52] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:22:04] <icinga-wm>	 PROBLEM - Host wdqs2009 is DOWN: PING CRITICAL - Packet loss = 100%
[23:23:41] <wikibugs>	 (03PS2) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375)
[23:24:22] <icinga-wm>	 RECOVERY - Host wdqs2009 is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms
[23:24:40] <wikibugs>	 (03PS4) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695)
[23:24:54] <wikibugs>	 (03PS3) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375)
[23:25:37] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[23:25:47] <wikibugs>	 (03CR) 10Dzahn: "I don't think it hurts but in other cases we have also just kept it at 5M permanently in case we need to switch and afaict there is no dow" [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[23:27:31] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2011.*
[23:27:37] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM 🕳" [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) (owner: 10Cwhite)
[23:27:37] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2012.*
[23:28:29] <ryankemper>	 !log T301167 wdqs2011/2012 were not visible in pybal (oversight from when I added the other hosts with conftool last week). Fixed that, so now all of the new hosts are showing up properly.
[23:28:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:33] <stashbot>	 T301167: Service implementation for wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T301167
[23:29:40] <ryankemper>	 !log [WDQS] Downtimed wdqs20[09-12] for the next 7 days
[23:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:44] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[23:29:58] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[23:31:14] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:31:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:32:48] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:33:00] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:33:16] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2026']
[23:33:51] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/868199/38794/deploy2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn)
[23:34:12] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect, ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:34:47] <wikibugs>	 (03PS4) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375)
[23:35:04] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash2026 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f8165543390: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[23:35:04] <icinga-wm>	 org/wiki/Search%23Administration
[23:36:00] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) (owner: 10Cwhite)
[23:40:57] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2026']
[23:41:30] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2026']
[23:44:46] <icinga-wm>	 PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100%
[23:46:40] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:46:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:48:12] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2026']
[23:48:36] <icinga-wm>	 RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms
[23:48:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:48:56] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash2026 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 662, active_shards: 1465, relocating_shards: 6, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[23:48:56] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:50:33] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2026.codfw.wmnet with OS bullseye
[23:53:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:53:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:55:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:57:44] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:59:58] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status