[00:10:03] You need to get someone with permission to do it [00:10:09] And/or get yourself added to the allow list [00:11:57] (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980 (https://phabricator.wikimedia.org/T296136) (owner: 104nn1l2) [00:12:40] (03PS3) 10Reedy: enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T47955) (owner: 10Inductiveload) [00:16:49] Thanks, how can I get myself added to the allow list? [00:20:04] https://www.mediawiki.org/wiki/Continuous_integration/Allow_list [00:20:26] tl;dr convince someone that you aren't malicious and should be added to https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/master/zuul/layout.yaml [00:25:32] Thanks, I'm an admin and interface admin on Commons: https://commons.wikimedia.org/wiki/User:4nn1l2 Been around about 10 years. Here is a list of my previous commits: https://phabricator.wikimedia.org/people/commits/4285/ Could someone pleas add me to the list? [00:50:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:56:01] Here is the associated patch: https://gerrit.wikimedia.org/r/c/integration/config/+/741985 Should I schedule it for a backport window or sth? [04:30:45] PROBLEM - Check systemd state on db1115 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:19] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:41:07] RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:31] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:54:21] (03PS1) 10Marostegui: control-mariadb-client-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/741997 (https://phabricator.wikimedia.org/T295965) [05:55:32] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/741997 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [05:56:03] (03Merged) 10jenkins-bot: control-mariadb-client-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/741997 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [06:19:45] !log killing lingering process from mwmaint to depooled db (db1160) that was depooled nine hours ago [06:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:11] !log killing extensions/MachineVision/maintenance/fetchSuggestions.php in mwmaint [06:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:10] Created T296507 [06:35:11] T296507: fetchSuggestions opens connection to depooled database after nine hours - https://phabricator.wikimedia.org/T296507 [07:13:59] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:16:11] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:34:11] (03CR) 10Elukey: "Left some ideas/comments, let me know your thoughts John!" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [07:43:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1160 (T296143)', diff saved to https://phabricator.wikimedia.org/P17873 and previous config saved to /var/cache/conftool/dbconfig/20211126-074320-ladsgroup.json [07:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:25] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [07:58:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1160 (T296143)', diff saved to https://phabricator.wikimedia.org/P17874 and previous config saved to /var/cache/conftool/dbconfig/20211126-075824-ladsgroup.json [07:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:29] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211126T0800) [08:13:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1160 (T296143)', diff saved to https://phabricator.wikimedia.org/P17875 and previous config saved to /var/cache/conftool/dbconfig/20211126-081329-ladsgroup.json [08:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:34] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [08:28:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1160 (T296143)', diff saved to https://phabricator.wikimedia.org/P17876 and previous config saved to /var/cache/conftool/dbconfig/20211126-082834-ladsgroup.json [08:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:39] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [08:50:03] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Rosalie_WMDE) @Jelto The document has been signed [08:50:19] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Rosalie_WMDE) [08:53:19] 10SRE, 10Data-Persistence, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10Marostegui) [09:06:40] (03PS1) 10Majavah: devtools: set doc1002 to use local puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/742078 [09:08:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, minor pedantic comment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [09:13:33] (03PS7) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) [09:15:28] (03CR) 10Majavah: "Tested on "devtools" cloud vps project. Works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [09:23:35] (03PS4) 10David Caro: WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) [09:23:52] (03PS5) 10David Caro: cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) [09:24:55] (03CR) 10David Caro: cli: add --fail-fast flag and behavior (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [09:35:24] (03CR) 10David Caro: cli: add --fail-fast flag and behavior (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [09:37:13] (03PS6) 10David Caro: cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) [09:44:06] (03CR) 10David Caro: cli: add --fail-fast flag and behavior (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [09:50:12] (03PS11) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [09:50:50] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [09:51:10] (03PS4) 10David Caro: timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) [09:51:48] (03PS12) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [09:52:00] (03PS5) 10David Caro: timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) [09:52:24] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [09:52:28] (03PS6) 10David Caro: timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) [09:54:26] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32656/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [09:55:10] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32657/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [09:55:34] (03CR) 10David Caro: [V: 03+1 C: 03+2] timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [09:55:45] (03CR) 10David Caro: [V: 03+1 C: 03+2] timesyncd: handle bullseye ntp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [09:58:33] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:04:06] (03PS1) 10David Caro: timsyncd: Flip the handling service condition [puppet] - 10https://gerrit.wikimedia.org/r/742107 (https://phabricator.wikimedia.org/T296456) [10:04:28] (03CR) 10David Caro: [C: 03+2] timsyncd: Flip the handling service condition [puppet] - 10https://gerrit.wikimedia.org/r/742107 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [10:04:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance T296143 [10:04:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance T296143 [10:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:56] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [10:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:26] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:05:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance T296274 [10:05:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance T296274 [10:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:45] T296274: Clean up wikiadmin GRANTs mess - https://phabricator.wikimedia.org/T296274 [10:05:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T296274)', diff saved to https://phabricator.wikimedia.org/P17877 and previous config saved to /var/cache/conftool/dbconfig/20211126-100547-ladsgroup.json [10:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:36] (03PS13) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [10:08:11] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:09:06] (03PS1) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) [10:09:25] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 23694 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:10:26] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:13:32] (03PS14) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [10:14:09] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool after fixing users T296274', diff saved to https://phabricator.wikimedia.org/P17878 and previous config saved to /var/cache/conftool/dbconfig/20211126-101423-ladsgroup.json [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:29] T296274: Clean up wikiadmin GRANTs mess - https://phabricator.wikimedia.org/T296274 [10:17:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1111.eqiad.wmnet with reason: Maintenance T296274 [10:17:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1111.eqiad.wmnet with reason: Maintenance T296274 [10:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T296274)', diff saved to https://phabricator.wikimedia.org/P17879 and previous config saved to /var/cache/conftool/dbconfig/20211126-101714-ladsgroup.json [10:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:25] (03PS15) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [10:20:59] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:23:06] (03PS16) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [10:23:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool after fixing users T296274', diff saved to https://phabricator.wikimedia.org/P17880 and previous config saved to /var/cache/conftool/dbconfig/20211126-102340-ladsgroup.json [10:23:41] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:45] T296274: Clean up wikiadmin GRANTs mess - https://phabricator.wikimedia.org/T296274 [10:23:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32661/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:26:37] (03PS1) 10David Caro: tests: move to pytest [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/742112 (https://phabricator.wikimedia.org/T296481) [10:28:38] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) I went the "set a different sampling pipeline for internal flows" way with the above POC for the reasons mentioned in T263... [10:33:11] (03PS17) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [10:33:26] (03CR) 10Jbond: P:base::certificates: update support for trusted CA (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:34:57] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:35:51] (03CR) 10Jbond: P:base::certificates: update support for trusted CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:37:26] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:37:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:42:26] (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:47:26] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:54:02] (03CR) 10Jbond: "looks good and my local tests pass" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [10:56:49] (03PS18) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [10:59:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/742112 (https://phabricator.wikimedia.org/T296481) (owner: 10David Caro) [11:00:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:01:26] atlas_exporter monitoring is flapping on an off quite frequently lately [11:04:26] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:09:26] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:13:29] (03CR) 10David Caro: cli: add --fail-fast flag and behavior (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [11:15:45] (03PS1) 10Filippo Giunchedi: pontoon: use profile::base on puppet master [puppet] - 10https://gerrit.wikimedia.org/r/742121 [11:15:49] (03CR) 10Jbond: cli: add --fail-fast flag and behavior (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [11:18:07] (03Abandoned) 10Filippo Giunchedi: pontoon: tmp remove base::puppet for duplicate declaration? [puppet] - 10https://gerrit.wikimedia.org/r/740595 (owner: 10Filippo Giunchedi) [11:19:16] (03PS1) 10Vgutierrez: cache::haproxy: Set stat socket privileve level to admin [puppet] - 10https://gerrit.wikimedia.org/r/742122 (https://phabricator.wikimedia.org/T290005) [11:20:28] (03PS2) 10Vgutierrez: cache::haproxy: Set stat socket privilege level to admin [puppet] - 10https://gerrit.wikimedia.org/r/742122 (https://phabricator.wikimedia.org/T290005) [11:24:21] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Set stat socket privilege level to admin [puppet] - 10https://gerrit.wikimedia.org/r/742122 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:25:50] !log restarting HAProxy on O:cache::(text|upload)_haproxy - T290005 [11:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:55] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:32:26] (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:39:15] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:10] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:41:12] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:17] !log T296303 cleanup weird state of calico-codfw cluster [11:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:20] T296303: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 [11:47:26] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:56:34] (03CR) 10David Caro: "I got a question and a few nits (feel free to ignore those)" [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [12:06:45] (03PS1) 10Vgutierrez: cache::haproxy: Relax HTTP parsing rules [puppet] - 10https://gerrit.wikimedia.org/r/742128 (https://phabricator.wikimedia.org/T290005) [12:09:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32662/console" [puppet] - 10https://gerrit.wikimedia.org/r/742128 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:12:29] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Relax HTTP parsing rules [puppet] - 10https://gerrit.wikimedia.org/r/742128 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:19:14] 10SRE, 10DBA, 10Privacy Engineering, 10WMF-Legal, and 3 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10Marostegui) 05Stalledā†’03Declined We are going to deprecate tendril in favour of orchestrator, we've already opened it for people under... [12:21:33] !log restarting HAProxy on O:cache::upload_haproxy - T290005 [12:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:37] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:23:39] (03PS1) 10Jbond: P:puppet_compiler::postgres_database: create ssl directory [puppet] - 10https://gerrit.wikimedia.org/r/742130 [12:24:01] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppet_compiler::postgres_database: create ssl directory [puppet] - 10https://gerrit.wikimedia.org/r/742130 (owner: 10Jbond) [12:40:37] (03PS1) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) [12:42:10] (03CR) 10jerkins-bot: [V: 04-1] ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:43:11] (03PS1) 10Arturo Borrero Gonzalez: hieradata: ceph: refresh bootstrap auth [labs/private] - 10https://gerrit.wikimedia.org/r/742133 (https://phabricator.wikimedia.org/T293752) [12:45:26] (03PS1) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [12:45:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32663/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [12:46:02] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [12:50:10] (03PS2) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) [12:54:44] (03CR) 10Majavah: [C: 03+1] "Let's give this a try at some point!" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [12:58:44] (03PS2) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [12:59:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32664/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [12:59:21] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [12:59:35] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:02:48] (03PS3) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:03:22] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:04:24] (03PS4) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:05:09] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:05:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32666/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:08:55] (03PS1) 10Kormat: .gitignore: Ignore __pycache__ dirs. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/742140 [13:11:31] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use profile::base on puppet master [puppet] - 10https://gerrit.wikimedia.org/r/742121 (owner: 10Filippo Giunchedi) [13:13:34] (03PS5) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:14:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32667/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:14:09] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:16:24] (03PS6) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:16:59] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:17:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32669/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:24:37] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [13:25:03] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:50] (03PS4) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 [13:25:52] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:47] (03PS3) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) [13:29:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:29:55] (03CR) 10David Caro: ceph: move bootstrap keyring into new auth abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:31:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:33:15] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:33:54] (03PS4) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) [13:35:37] (03CR) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:35:49] 10SRE, 10Analytics, 10Observability-Metrics: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10fgiunchedi) [13:36:16] 10SRE, 10Observability-Logging, 10User-ema: rsyslog errors about duplicate module includes - https://phabricator.wikimedia.org/T292175 (10fgiunchedi) [13:36:29] 10SRE, 10Observability-Logging, 10User-ema: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used - https://phabricator.wikimedia.org/T292180 (10fgiunchedi) [13:37:47] (03PS7) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:38:28] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:38:56] (03CR) 10Jelto: [C: 03+2] helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [13:40:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10fgiunchedi) [13:41:09] (03PS8) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:41:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add ownership annotations for more Service SRE services [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:41:44] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:42:28] (03Merged) 10jenkins-bot: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [13:43:29] (03PS9) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:44:00] (03PS10) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:44:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32672/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:44:35] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:46:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32673/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:46:29] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:38] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:48:39] (03CR) 10David Caro: ceph: move bootstrap keyring into new auth abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:55] 10Puppet, 10SRE, 10Infrastructure-Foundations: Role hieradata for non-existent roles - https://phabricator.wikimedia.org/T296533 (10Majavah) [13:52:35] (03PS11) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:52:59] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Ladsgroup) I ran it in the cloud. So far everything looks good. [13:53:10] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:55:43] (03PS12) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:56:20] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:57:17] (03PS13) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [13:57:26] (03PS5) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) [13:57:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32677/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:57:54] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [13:58:37] (03CR) 10David Caro: [C: 03+1] hieradata: ceph: refresh bootstrap auth [labs/private] - 10https://gerrit.wikimedia.org/r/742133 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:58:57] (03PS2) 10Arturo Borrero Gonzalez: hieradata: ceph: refresh bootstrap auth [labs/private] - 10https://gerrit.wikimedia.org/r/742133 (https://phabricator.wikimedia.org/T293752) [14:00:24] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Ladsgroup) Added a massive text to auto response and it worked fine meaning the schema change fixes the issue. I think we can move forwar... [14:02:54] (03PS14) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [14:03:29] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [14:03:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32678/console" [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [14:06:25] (03PS15) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [14:07:58] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [14:08:23] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) Sounds good to me, I can help with the deployment :) [14:09:31] (03PS16) 10Jbond: P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 [14:10:57] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [14:11:04] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler::postgres_database: pass config via hiera [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [14:14:02] (03PS17) 10Jbond: P:puppet_compiler: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/742136 [14:15:51] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: Refactor [puppet] - 10https://gerrit.wikimedia.org/r/742136 (owner: 10Jbond) [14:15:54] 10ops-codfw: logstash2028.mgmt flapping - https://phabricator.wikimedia.org/T296540 (10fgiunchedi) [14:21:20] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [14:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:35] (03PS1) 10Jbond: puppet_compiler: mkdir_p workdir not vardir [puppet] - 10https://gerrit.wikimedia.org/r/742146 [14:23:32] (03CR) 10Jbond: [C: 03+2] puppet_compiler: mkdir_p workdir not vardir [puppet] - 10https://gerrit.wikimedia.org/r/742146 (owner: 10Jbond) [14:25:41] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [14:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:20] (03PS1) 10Filippo Giunchedi: logstash: log receiver and instance alert labels [puppet] - 10https://gerrit.wikimedia.org/r/742147 [14:30:24] (03CR) 10jerkins-bot: [V: 04-1] logstash: log receiver and instance alert labels [puppet] - 10https://gerrit.wikimedia.org/r/742147 (owner: 10Filippo Giunchedi) [14:30:32] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [14:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:54] (03PS1) 10Jbond: puppet_compiler: dont use mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/742149 [14:37:31] (03PS2) 10Jbond: puppet_compiler: dont use mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/742149 [14:38:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32681/console" [puppet] - 10https://gerrit.wikimedia.org/r/742149 (owner: 10Jbond) [14:39:15] (03PS2) 10Filippo Giunchedi: logstash: log additional alert labels [puppet] - 10https://gerrit.wikimedia.org/r/742147 [14:39:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet_compiler: dont use mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/742149 (owner: 10Jbond) [14:40:22] (03CR) 10Awight: "I'm not sure how to run the image or the varnishtest, so the patch was made blindly." [puppet] - 10https://gerrit.wikimedia.org/r/742148 (https://phabricator.wikimedia.org/T296512) (owner: 10Awight) [15:00:31] (03PS1) 10MMandere: admin: Add user rosalie-wmde to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/742152 (https://phabricator.wikimedia.org/T295765) [15:03:15] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:06:14] 10ops-eqiad, 10DC-Ops: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10jcrespo) [15:06:30] (03PS1) 10Jcrespo: mariadb: Reduce memory allocation for dbs at db1102 due to hw failure [puppet] - 10https://gerrit.wikimedia.org/r/742153 (https://phabricator.wikimedia.org/T296546) [15:07:58] 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups, 10Patch-For-Review: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10jcrespo) [15:08:15] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reduce memory allocation for dbs at db1102 due to hw failure [puppet] - 10https://gerrit.wikimedia.org/r/742153 (https://phabricator.wikimedia.org/T296546) (owner: 10Jcrespo) [15:15:41] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.24 ms [15:17:23] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:42] (03PS1) 10Jbond: puppet_compiler: additional volume is not ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/742158 [15:33:01] PROBLEM - Disk space on ores1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [15:33:38] (03CR) 10Jbond: [C: 03+2] puppet_compiler: additional volume is not ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/742158 (owner: 10Jbond) [15:40:14] whose handling orest lately? research? machine learning? [15:40:19] *ORES [15:42:14] ML :) [15:42:30] ah snap I see the alert, my highlights for IRC didn't work [15:42:32] sigh [15:42:37] deploy-cache seems the culprit [15:43:19] ah, no, that is not [15:44:00] it seems /var/tmp [15:44:04] yepo [15:44:07] *yep [15:44:40] ah lovely some fresh coredumps [15:44:55] celery indeed segfaulted [15:45:35] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:43] for mysql we disabled cores- a 500GB files wasn't that useful :-) [15:45:55] as in, coredumps, not processor cores :-) [15:46:45] !log move /var/tmp/core/* to /srv/coredumps on ores1008 to free root space [15:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:08] yeah we should put a limit and/or move the target dir on a bigger partition [15:49:46] wow the disks are really slow [15:50:42] (03PS1) 10Jbond: P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 [15:51:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32682/console" [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [15:51:34] (03CR) 10jerkins-bot: [V: 04-1] P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [15:53:09] (03PS2) 10Jbond: P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 [15:53:51] (03CR) 10jerkins-bot: [V: 04-1] P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [15:53:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32683/console" [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [15:54:07] RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [15:55:19] (03PS3) 10Jbond: P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 [15:55:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32684/console" [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [15:56:37] (03CR) 10jerkins-bot: [V: 04-1] P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [15:58:23] (03PS4) 10Jbond: P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 [15:59:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32685/console" [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [16:00:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:ci::slave::labs::common: Add toggle for lvm managment [puppet] - 10https://gerrit.wikimedia.org/r/742163 (owner: 10Jbond) [16:03:20] (03PS1) 10Jelto: charts: fix affinity indentation in charts and scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 [16:05:53] !log drain kubestage1001 node in prep for decommissioning [16:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:34] (03CR) 10Btullis: "As discussed in the attached ticket, I propose that we abandon this CR and make a follow-up ticket to create an inventory of the paging le" [puppet] - 10https://gerrit.wikimedia.org/r/681420 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [16:09:36] (03PS1) 10Lucas Werkmeister (WMDE): Update termbox to 2021-11-26-093451-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) [16:10:52] (03Abandoned) 10Btullis: alerts: add victorops paging for hadoop master and kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/681420 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [16:11:03] !log drain kubestage1002 node in prep for decommissioning [16:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:25] (03CR) 10Lucas Werkmeister (WMDE): "Iā€™m not sure if both files should be updated in the same commit, but it looks like this is what was done in the past, and if I understand " [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [16:20:39] (03PS1) 10Majavah: admin: add .bashrc for taavi [puppet] - 10https://gerrit.wikimedia.org/r/742168 [16:23:22] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb postgress server: fix dependcey loop - https://phabricator.wikimedia.org/T296550 (10jbond) p:05Triageā†’03Medium [16:24:25] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb postgress server: fix dependcey loop - https://phabricator.wikimedia.org/T296550 (10jbond) [16:24:32] (03CR) 10Jelto: [C: 04-1] gitlab: restore script keep_config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [16:28:07] (03PS2) 10Majavah: admin: add .bashrc for taavi [puppet] - 10https://gerrit.wikimedia.org/r/742168 [16:37:09] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:52:24] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I have seen and reviewed most of this code before in https://gitlab.com/wmde/wmde-technicalwishes-docker-dev/-/merge_requests/36/diffs and" [puppet] - 10https://gerrit.wikimedia.org/r/742148 (https://phabricator.wikimedia.org/T296512) (owner: 10Awight) [16:53:23] (03PS6) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) [16:59:39] (03PS4) 10Hnowlan: partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [16:59:56] (03CR) 10Hnowlan: partman: add reuse partman profile for cassandra hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [17:00:00] (03PS5) 10Hnowlan: partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [17:06:27] (03PS3) 10Jcrespo: admin: add .bashrc for taavi [puppet] - 10https://gerrit.wikimedia.org/r/742168 (owner: 10Majavah) [17:07:32] (03CR) 10Jcrespo: [C: 03+2] admin: add .bashrc for taavi [puppet] - 10https://gerrit.wikimedia.org/r/742168 (owner: 10Majavah) [17:15:17] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [17:16:35] (03CR) 10Jcrespo: "Is there some obscure puppet functionality I cannot see (eg. custom module outside production referencing the btulis files) or is this a m" [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [17:19:47] (03PS1) 10Jcrespo: admin: Fix path of btullis' dotfiles and one script [puppet] - 10https://gerrit.wikimedia.org/r/742172 (https://phabricator.wikimedia.org/T285754) [17:21:38] (03CR) 10Jcrespo: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/742172/" [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [17:27:25] (03CR) 10Jbond: Add initial personal dotfiles and one script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [17:27:38] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/742172 (https://phabricator.wikimedia.org/T285754) (owner: 10Jcrespo) [17:30:25] (03CR) 10Jcrespo: [C: 03+1] "I think this should be safe to merge, but as I am about to leave for the weekend, I will let btullis themself merge at their convenience, " [puppet] - 10https://gerrit.wikimedia.org/r/742172 (https://phabricator.wikimedia.org/T285754) (owner: 10Jcrespo) [17:42:58] (03PS5) 10Hnowlan: C:cassandra: add optional java_package variable [puppet] - 10https://gerrit.wikimedia.org/r/722599 (https://phabricator.wikimedia.org/T261966) (owner: 10Jbond) [17:44:33] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32687/console" [puppet] - 10https://gerrit.wikimedia.org/r/722599 (https://phabricator.wikimedia.org/T261966) (owner: 10Jbond) [17:56:52] (03CR) 10Hnowlan: [V: 03+1] "lgtm, I can merge this on monday" [puppet] - 10https://gerrit.wikimedia.org/r/722599 (https://phabricator.wikimedia.org/T261966) (owner: 10Jbond) [18:19:09] (03PS7) 10Arturo Borrero Gonzalez: ceph: move bootstrap keyring into new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) [18:19:11] (03PS1) 10Arturo Borrero Gonzalez: profile: ceph: cleanup firewall config [puppet] - 10https://gerrit.wikimedia.org/r/742174 [18:19:13] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) [18:19:15] (03PS1) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [18:21:03] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [18:22:32] (03PS2) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [18:24:46] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [18:44:18] RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:54] PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:22] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:14:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) Earlier in the week I attempted to remove the "metric-out minimum-igp" from the iBGP session between cr1-eqi... [19:22:16] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 23674 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:30:00] Does anybody know when approximately Wikimedia will migrate from Gerrit to Gitlab? [19:30:58] "some date that's not in the past" is the best I have [19:31:25] it also depends what you mean by "Wikimedia" and "migrate" [19:32:04] I'm reading Gerrit user manual. I want to know if it's worth of my time. [19:33:21] I expect we'll be using Gerrit for a while. Repositories will slowly migrate to GitLab, as it becomes more stable/usable [20:05:50] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:06:13] I've wondered this as well. I've always found that Gerrit seems to be... just fine, and it surprises me that a move off it and all the tooling we already have for it is worth something. [20:08:19] Although it's probably also because change is bad, everything old is good, everything new is bad, keep things the same, yada yada [20:14:36] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 23694 bytes in 0.251 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:19:10] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:41] nn1l2: we are in the process of that migration. see https://www.mediawiki.org/wiki/GitLab/Roadmap#pioneers [20:34:15] there is a channel for discussion and collaboration on migrating things at #wikimedia-gitlab [20:38:48] Thanks for the link! [21:07:47] (03CR) 10Ssingh: [C: 03+1] "confirmed UID, NDA, L3, SSH key." [puppet] - 10https://gerrit.wikimedia.org/r/742152 (https://phabricator.wikimedia.org/T295765) (owner: 10MMandere) [21:31:54] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [21:34:02] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [22:14:24] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:23:08] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 23681 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:22:40] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:32:08] PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:00] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:37:52] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:52] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:37:58] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:34] PROBLEM - Check systemd state on ores1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:26] PROBLEM - Disk space on ores1007 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1007&var-datasource=eqiad+prometheus/ops [23:51:18] PROBLEM - Disk space on ores1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops