[00:39:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931916 [00:39:40] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931916 (owner: 10TrainBranchBot) [01:02:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931916 (owner: 10TrainBranchBot) [01:13:30] (Not accepting/receiving prefixes from anycast BGP peer) resolved: Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [01:15:27] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:19:06] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [01:35:23] (03CR) 10Eevans: [C: 03+1] hiera: set ms-be2068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932197 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [01:51:48] (03CR) 10Eevans: [C: 03+1] cassandra: add initial support for PKI TLS certs to 4.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:11:05] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:20:12] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:24:46] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:56:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:01:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:28:06] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:44:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:49:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:57:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118', diff saved to https://phabricator.wikimedia.org/P49472 and previous config saved to /var/cache/conftool/dbconfig/20230623-045758-root.json [05:17:10] (03PS1) 10Marostegui: db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/932356 (https://phabricator.wikimedia.org/T335092) [05:17:38] (03CR) 10Marostegui: [C: 03+2] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/932356 (https://phabricator.wikimedia.org/T335092) (owner: 10Marostegui) [05:38:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:45:55] (03PS1) 10Alexandros Kosiaris: helmfile.d: Add wikifunctions stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) [05:46:02] (03PS2) 10Jameel Kaisar: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) [05:49:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 14860 [05:50:18] (03PS1) 10Alexandros Kosiaris: deployment_server: Add stanzas for wikifunctions k8s [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) [05:50:42] (03CR) 10CI reject: [V: 04-1] deployment_server: Add stanzas for wikifunctions k8s [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [05:52:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 14860 [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230623T0600) [06:01:55] (03PS2) 10Alexandros Kosiaris: deployment_server: Add stanzas for wikifunctions k8s [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) [06:14:22] (03PS5) 10Muehlenhoff: Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) [06:23:04] (03PS1) 10Elukey: turnilo: add https field to webrequest_sampled_live datacube [puppet] - 10https://gerrit.wikimedia.org/r/932361 (https://phabricator.wikimedia.org/T340097) [06:29:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [06:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [06:35:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:42:11] (03PS1) 10Slyngshede: netbox:standalone Switch back to CAS [puppet] - 10https://gerrit.wikimedia.org/r/932362 [06:43:55] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:44:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41942/console" [puppet] - 10https://gerrit.wikimedia.org/r/932362 (owner: 10Slyngshede) [06:44:48] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] netbox:standalone Switch back to CAS [puppet] - 10https://gerrit.wikimedia.org/r/932362 (owner: 10Slyngshede) [06:50:41] (03PS1) 10Slyngshede: netbox:standalone: Move to production IDP [puppet] - 10https://gerrit.wikimedia.org/r/932363 [06:51:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41943/console" [puppet] - 10https://gerrit.wikimedia.org/r/932363 (owner: 10Slyngshede) [06:52:53] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] netbox:standalone: Move to production IDP [puppet] - 10https://gerrit.wikimedia.org/r/932363 (owner: 10Slyngshede) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230623T0700) [07:01:35] (03PS1) 10Muehlenhoff: Update access for Jennifer [puppet] - 10https://gerrit.wikimedia.org/r/932364 [07:17:45] (03CR) 10Muehlenhoff: [C: 03+2] Update access for Jennifer [puppet] - 10https://gerrit.wikimedia.org/r/932364 (owner: 10Muehlenhoff) [07:25:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:26:03] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:53:50] (03PS1) 10Slyngshede: netbox:standalone: Switch to idp-test [puppet] - 10https://gerrit.wikimedia.org/r/932373 [07:55:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41944/console" [puppet] - 10https://gerrit.wikimedia.org/r/932373 (owner: 10Slyngshede) [07:57:57] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] netbox:standalone: Switch to idp-test [puppet] - 10https://gerrit.wikimedia.org/r/932373 (owner: 10Slyngshede) [08:08:17] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:09:55] (03PS7) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [08:18:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [08:22:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:22:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:25:08] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:25:17] (03CR) 10Elukey: [C: 03+2] role::ml_cache::storage: use pki truststore [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:33:17] (03PS12) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [08:37:45] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1110.eqiad.wmnet [08:40:59] (03CR) 10Elukey: [C: 03+2] role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:41:54] (03PS2) 10Elukey: Move drmrs Varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932219 (https://phabricator.wikimedia.org/T337825) [08:46:06] (03CR) 10Vgutierrez: [C: 03+1] Move drmrs Varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932219 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:48:00] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-cache1001.eqiad.wmnet with reason: Working on pki [08:48:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-cache1001.eqiad.wmnet with reason: Working on pki [08:51:17] (03PS1) 10Isabelle Hurbain-Palatin: Enable parsoid support for Kartographer on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932376 (https://phabricator.wikimedia.org/T340134) [08:57:59] PROBLEM - SSH on an-worker1110 is CRITICAL: connect to address 10.64.36.142 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:11] (03CR) 10Clément Goubert: [C: 03+1] Enable parsoid support for Kartographer on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932376 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [09:08:44] (03CR) 10Hashar: [C: 03+2] "Merci clément!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932376 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [09:09:29] (03Merged) 10jenkins-bot: Enable parsoid support for Kartographer on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932376 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [09:09:40] I am deploying that config change which is solely for beta cluster [09:10:56] done with `scap backport 932376` which did not sync anything since that is only for beta: [09:10:56] 09:10:11 Skipping sync since all commits were beta/labs-only changes. Operation completed. [09:11:29] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:11:46] (03PS8) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [09:13:06] (03PS1) 10Elukey: profile::cassandra: add hiera option for the TLS keystore password [puppet] - 10https://gerrit.wikimedia.org/r/932378 (https://phabricator.wikimedia.org/T288470) [09:13:57] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:15:36] (03PS1) 10Elukey: role::ml_cache::storage: add fake TLS keystore password for PKI [labs/private] - 10https://gerrit.wikimedia.org/r/932379 (https://phabricator.wikimedia.org/T288470) [09:15:53] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::ml_cache::storage: add fake TLS keystore password for PKI [labs/private] - 10https://gerrit.wikimedia.org/r/932379 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:15:55] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:16:57] (03CR) 10Clément Goubert: "Hi Jaime," [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [09:17:03] (03PS1) 10Fabfur: [beta] Update wgCdnServersNoPurge for new cache server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932380 (https://phabricator.wikimedia.org/T327742) [09:18:05] PROBLEM - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is CRITICAL: connect to address 10.64.130.9 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:18:13] PROBLEM - cassandra-a SSL 10.64.130.9:7001 on ml-cache1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:18:33] PROBLEM - Check systemd state on ml-cache1001 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:52] this is me testing --^ [09:18:55] downtime expired [09:19:25] PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:19:28] (03CR) 10Lucas Werkmeister (WMDE): Enable parsoid support for Kartographer on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932376 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [09:20:57] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1110.eqiad.wmnet [09:21:23] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:21:40] (03PS1) 10Elukey: Revert "role::ml_cache::storage: add fake TLS keystore password for PKI" [labs/private] - 10https://gerrit.wikimedia.org/r/932274 [09:21:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] Revert "role::ml_cache::storage: add fake TLS keystore password for PKI" [labs/private] - 10https://gerrit.wikimedia.org/r/932274 (owner: 10Elukey) [09:23:19] (03PS2) 10Elukey: profile::cassandra: add hiera option for the TLS keystore password [puppet] - 10https://gerrit.wikimedia.org/r/932378 (https://phabricator.wikimedia.org/T288470) [09:23:21] (03PS1) 10Elukey: cassandra: allow to set the keystore password for 4.x [puppet] - 10https://gerrit.wikimedia.org/r/932381 (https://phabricator.wikimedia.org/T288470) [09:24:43] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41947/console" [puppet] - 10https://gerrit.wikimedia.org/r/932381 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:24:45] (03CR) 10Jaime Nuche: [WIP] deployment: Use rsync::quickdatacopy, enable encryption (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [09:25:11] (03PS1) 10Elukey: Revert "Revert "role::ml_cache::storage: add fake TLS keystore password for PKI"" [labs/private] - 10https://gerrit.wikimedia.org/r/932275 [09:25:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] Revert "Revert "role::ml_cache::storage: add fake TLS keystore password for PKI"" [labs/private] - 10https://gerrit.wikimedia.org/r/932275 (owner: 10Elukey) [09:26:15] !log uploaded openjdk-8 8u372-ga-1~deb10u1 to component/jdk8 (forward port of Java 8 for Buster) [09:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41948/console" [puppet] - 10https://gerrit.wikimedia.org/r/932378 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:27:50] (03PS6) 10Arturo Borrero Gonzalez: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) [09:31:14] (03PS1) 10Elukey: ml-services: update docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/932382 [09:31:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932378 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:32:16] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra: allow to set the keystore password for 4.x [puppet] - 10https://gerrit.wikimedia.org/r/932381 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:32:18] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/932361 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [09:32:23] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::cassandra: add hiera option for the TLS keystore password [puppet] - 10https://gerrit.wikimedia.org/r/932378 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:34:07] (03CR) 10Klausman: [C: 03+1] ml-services: update docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/932382 (owner: 10Elukey) [09:34:29] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932382 (owner: 10Elukey) [09:35:29] (03CR) 10Elukey: [C: 03+2] ml-services: update docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/932382 (owner: 10Elukey) [09:35:53] PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:25] RECOVERY - SSH on an-worker1110 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:10] (03CR) 10MVernon: [C: 03+2] hiera: set ms-be2068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932197 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [09:37:27] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:34] (03PS7) 10Arturo Borrero Gonzalez: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) [09:37:47] (03PS9) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [09:40:07] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/931880/41952/" [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:40:36] (ProbeDown) firing: (2) Service releases1002:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:05] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:43:31] PROBLEM - Check systemd state on ms-be2068 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-sharder.service,swift-object-reconstructor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:33] (03PS1) 10Majavah: openldap: remove default value [puppet] - 10https://gerrit.wikimedia.org/r/932383 [09:44:36] (03CR) 10Arturo Borrero Gonzalez: openldap: remove default value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932383 (owner: 10Majavah) [09:45:17] (03PS8) 10Arturo Borrero Gonzalez: openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) [09:45:19] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41954/console" [puppet] - 10https://gerrit.wikimedia.org/r/932383 (owner: 10Majavah) [09:45:58] (03CR) 10Majavah: [V: 03+1] openldap: remove default value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932383 (owner: 10Majavah) [09:46:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/931880/41952/" [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:48:34] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me, but I haven't tested it will work as is, wmfbackups package config may need tuning later (but won't create any problem)" [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:48:52] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: pdns: add backup for the database [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:49:52] (03PS4) 10Jelto: miscweb: add statictendril release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [09:51:35] (03CR) 10Jelto: [C: 03+1] "thanks! lgtm once we have the tendril image build and ready" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [09:52:18] (03CR) 10Jforrester: [C: 03+1] deployment_server: Add stanzas for wikifunctions k8s [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [09:53:19] RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:53:27] RECOVERY - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is OK: TCP OK - 0.000 second response time on 10.64.130.9 port 9042 https://phabricator.wikimedia.org/T93886 [09:53:37] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:53:57] RECOVERY - Check systemd state on ml-cache1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:22] (03CR) 10Jforrester: helmfile.d: Add wikifunctions stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932357 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [09:57:25] (03PS1) 10Elukey: cassandra: allow to update the keystore password - part two [puppet] - 10https://gerrit.wikimedia.org/r/932384 (https://phabricator.wikimedia.org/T288470) [09:58:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41956/console" [puppet] - 10https://gerrit.wikimedia.org/r/932384 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:59:06] (03PS2) 10Elukey: cassandra: allow to update the keystore password - part two [puppet] - 10https://gerrit.wikimedia.org/r/932384 (https://phabricator.wikimedia.org/T288470) [09:59:51] (03CR) 10Elukey: [C: 03+2] turnilo: add https field to webrequest_sampled_live datacube [puppet] - 10https://gerrit.wikimedia.org/r/932361 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [10:00:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41957/console" [puppet] - 10https://gerrit.wikimedia.org/r/932384 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:00:35] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra: allow to update the keystore password - part two [puppet] - 10https://gerrit.wikimedia.org/r/932384 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:02:33] (03PS8) 10Clément Goubert: deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [10:06:35] PROBLEM - cassandra-a SSL 10.64.32.186:7001 on ml-cache1002 is CRITICAL: SSL CRITICAL - failed to verify ml-cache1002-a against ml-cache1002.eqiad.wmnet:Certificate ml-cache1002.eqiad.wmnet valid until 2023-07-21 09:06:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:07:05] lol [10:07:10] jbond: --^ [10:08:25] lol [10:08:37] PROBLEM - cassandra-a SSL 10.64.134.8:7001 on ml-cache1003 is CRITICAL: SSL CRITICAL - failed to verify ml-cache1003-a against ml-cache1003.eqiad.wmnet:Certificate ml-cache1003.eqiad.wmnet valid until 2023-07-21 08:48:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:08:46] jbond: anyway, two cassandra clusters running pki! \o/ [10:08:54] nice work [10:09:02] \o/ [10:09:04] thanks as always for the support [10:09:45] no problem alwasy happy to help [10:10:19] (03PS9) 10Clément Goubert: deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [10:10:27] PROBLEM - cassandra-a SSL 10.192.0.222:7001 on ml-cache2001 is CRITICAL: SSL CRITICAL - failed to verify ml-cache2001-a against ml-cache2001.codfw.wmnet:Certificate ml-cache2001.codfw.wmnet valid until 2023-07-21 08:45:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:11:23] PROBLEM - cassandra-a SSL 10.192.16.190:7001 on ml-cache2002 is CRITICAL: SSL CRITICAL - failed to verify ml-cache2002-a against ml-cache2002.codfw.wmnet:Certificate ml-cache2002.codfw.wmnet valid until 2023-07-21 08:52:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:11:26] (03CR) 10Hashar: [C: 03+2] Enable parsoid support for Kartographer on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932376 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [10:12:19] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:35] PROBLEM - cassandra-a SSL 10.192.32.72:7001 on ml-cache2003 is CRITICAL: SSL CRITICAL - failed to verify ml-cache2003-a against ml-cache2003.codfw.wmnet:Certificate ml-cache2003.codfw.wmnet valid until 2023-07-21 09:02:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:12:36] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41958/console" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [10:12:54] !log installing vim security updates [10:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:17] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:20:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1110.eqiad.wmnet [10:27:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1110.eqiad.wmnet [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [10:38:00] (03PS10) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [10:38:31] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) This issue is excerpted by the fact that integration of thumbor to private swift containers is completely broken and it can't save any thumbnail it makes and has to re-thumbnail it ever... [10:51:09] (03CR) 10JMeybohm: [C: 03+1] deployment_server: Add stanzas for wikifunctions k8s [puppet] - 10https://gerrit.wikimedia.org/r/932358 (https://phabricator.wikimedia.org/T340041) (owner: 10Alexandros Kosiaris) [10:53:19] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2005-dev), No backups: 3 (cloudservices1004, ...), Fresh: 127 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:53:30] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) ` swift post wikipedia-office-local-public --read-acl "mw:thumbor,mw:thumbor-private,mw:media,.r:*" --write-acl "mw:thumbor,mw:thumbor-private,mw:media" ` Should fix the officewiki one,... [10:53:45] (03CR) 10Vgutierrez: Create cookbook to upgrade Apache Traffic Server (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [10:56:28] (03PS1) 10Btullis: Enable the networkpolicy for datahub batch jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932387 (https://phabricator.wikimedia.org/T329514) [10:58:55] (03CR) 10Btullis: [C: 03+2] Enable the networkpolicy for datahub batch jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932387 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:59:46] (03Merged) 10jenkins-bot: Enable the networkpolicy for datahub batch jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932387 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:02:04] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:03:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openldap: remove default value [puppet] - 10https://gerrit.wikimedia.org/r/932383 (owner: 10Majavah) [11:04:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openldap: remove default value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932383 (owner: 10Majavah) [11:06:35] (03PS1) 10Jcrespo: openstack: pdns: Change backup user for dump and make statistics configurable [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) [11:06:59] (03CR) 10CI reject: [V: 04-1] openstack: pdns: Change backup user for dump and make statistics configurable [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [11:07:56] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) Yeah, if it's `thumbor-private` trying to write to that container, we will not go to space today. I think, though, that the write ACL needs updating on the thumb container, though,... [11:08:33] (03PS2) 10Jcrespo: openstack: pdns: Change backup user for dump and make statistics configurable [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) [11:10:54] (03CR) 10Arturo Borrero Gonzalez: openstack: pdns: Change backup user for dump and make statistics configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [11:11:30] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) > I don't think thumbor needs rw to the image container rather than the thumb one mw also needs to write to it for upload What about? ` swift post wikipedia-office-local-public --read-... [11:11:44] (03PS3) 10Jcrespo: openstack: pdns: Change backup user for dump and make statistics configurable [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) [11:12:22] (03PS4) 10Jcrespo: openstack: pdns: Change backup user for dump and make statistics configurable [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) [11:12:27] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:13:21] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:16:19] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) >>! In T338765#8958238, @Ladsgroup wrote: >> I don't think thumbor needs rw to the image container rather than the thumb one > mw also needs to write to it for upload Sorry, I don'... [11:19:54] (03PS1) 10Slyngshede: D:apereo_cas::service fix group membership validation [puppet] - 10https://gerrit.wikimedia.org/r/932389 [11:22:02] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41960/console" [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [11:24:12] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41961/console" [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [11:24:16] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:27:16] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) Can and should, read and write in private wikis must go through mw as that's the only part of infra that has the knowledge if the requesting party is actually authorized to read or writ... [11:28:23] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:29:54] (03CR) 10Jcrespo: openstack: pdns: Change backup user for dump and make statistics configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [11:32:37] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) If your question is that if that's already the case or not, it is. I'm just saying why it should stay as is [11:34:59] (03CR) 10Jbond: [C: 03+1] D:apereo_cas::service fix group membership validation [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [11:35:26] (03CR) 10Arturo Borrero Gonzalez: openstack: pdns: Change backup user for dump and make statistics configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [11:36:06] (03CR) 10Jbond: [C: 03+1] D:apereo_cas::service fix group membership validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [11:36:18] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) Sorry, we may be talking past each other. I think that to make thumbs on officewiki work, we would need to add `mw:thumbor-private` to the write acl to `wikipedia-office-local-thumb... [11:44:38] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) ah, yeah. That's good. My main worry for that is that it might need to do some write that's not obvious, e.g. update list of thumbnails of a given image in the main container so I rathe... [11:47:06] 10SRE: compare Probenet data w/ NEL data - https://phabricator.wikimedia.org/T337317 (10JameelKaisar) The 'elapsed_time' filed in the NEL report is similar to the 'duration_ms' field in the Probenet report. They are not equal but follow a similar trend. If we ignore the first pulse to each data center (identifie... [11:49:12] (03CR) 10Jcrespo: "This is the list of permissions to add:" [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [11:49:27] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:53:42] (03PS11) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [11:54:06] (03CR) 10CI reject: [V: 04-1] Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:59:38] (03PS12) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [12:08:16] (03CR) 10Muehlenhoff: "I miss the full context on the OIDC work, but JFTR relying on memberOf is fine, it's getting updated via a slapd overlay and when I added " [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [12:09:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:10:09] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:25] (03PS13) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [12:10:42] !log updating ACLs on wikipedia-office containers T340189 T338765 [12:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:47] T338765: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 [12:12:15] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) I think office wiki thumbnails are now working - e.g. https://office.wikimedia.org/w/thumb.php?f=Abbrev-bot.png&width=120 now shows me a thumb. [12:14:37] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) Let me check if thumbor can actually store them now [12:15:55] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) https://upload.wikimedia.org/wikipedia/office/e/e6/CA_KPIs_-_Q2.pdf still is world readable. It might be cached in the edges but https://upload.wikimedia.org/wikipedia/office/e/e6/CA_KP... [12:15:57] (03PS2) 10Slyngshede: D:apereo_cas::service fix group membership validation [puppet] - 10https://gerrit.wikimedia.org/r/932389 [12:17:00] (03CR) 10Slyngshede: "Added some comments regarding the various OIDC bits." [puppet] - 10https://gerrit.wikimedia.org/r/932389 (owner: 10Slyngshede) [12:18:47] (03PS3) 10Slyngshede: D:apereo_cas::service fix group membership validation [puppet] - 10https://gerrit.wikimedia.org/r/932389 [12:20:54] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) and thumbor private still can't write I think: ` root@ms-fe1009:~# swift list wikipedia-office-local-thumb --prefix 7/7b/Abbrev-bot.png root@ms-fe1009:~# ` Maybe swift needs a flush/... [12:21:36] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) No, I have to remember that `codfw` and `eqiad` are two different clusters, and do the same thing on both. Sorry, done now. [12:22:28] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) \o/ ` root@ms-fe1009:~# swift list wikipedia-office-local-thumb --prefix 7/7b/Abbrev-bot.png 7/7b/Abbrev-bot.png/120px-Abbrev-bot.png 7/7b/Abbrev-bot.png/800px-Abbrev-bot.png ` [12:24:04] (03PS1) 10Majavah: aptly: clean up code style [puppet] - 10https://gerrit.wikimedia.org/r/932395 [12:24:06] (03PS1) 10Majavah: debian: add bookworm as a valid codename [puppet] - 10https://gerrit.wikimedia.org/r/932396 [12:24:08] (03PS1) 10Majavah: P:toolforge: aptly: add a system user to own the repository [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) [12:24:10] (03PS1) 10Majavah: jwt_authorizer: support templates for validation [puppet] - 10https://gerrit.wikimedia.org/r/932398 [12:24:12] (03PS1) 10Majavah: P:toolforge: aptly: enable Aptly API [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) [12:25:43] (03PS1) 10Ayounsi: Ignore LAGs from test_port_block_consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 [12:27:05] (03PS2) 10Ayounsi: Ignore LAGs from test_port_block_consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 [12:27:15] (03PS2) 10Majavah: jwt_authorizer: support templates for validation [puppet] - 10https://gerrit.wikimedia.org/r/932398 [12:27:17] (03PS2) 10Majavah: P:toolforge: aptly: enable Aptly API [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) [12:27:42] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) Special:NewFiles now work just fine, should call this done? [12:28:04] (03CR) 10CI reject: [V: 04-1] P:toolforge: aptly: add a system user to own the repository [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [12:29:52] (03PS2) 10Majavah: P:toolforge: aptly: add a system user to own the repository [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) [12:29:54] (03PS3) 10Majavah: jwt_authorizer: support templates for validation [puppet] - 10https://gerrit.wikimedia.org/r/932398 [12:29:56] (03PS3) 10Majavah: P:toolforge: aptly: enable Aptly API [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) [12:29:58] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) Let me fix collab first, and then I think we can close here. [12:31:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41966/console" [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [12:31:37] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:32:25] (03PS14) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [12:33:18] (03PS4) 10Majavah: P:toolforge: aptly: enable Aptly API [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) [12:34:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:35:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41967/console" [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [12:36:00] (03PS5) 10Majavah: P:toolforge: aptly: enable Aptly API [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) [12:37:14] (03PS5) 10Gehel: query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 [12:37:33] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [12:39:26] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41968/console" [puppet] - 10https://gerrit.wikimedia.org/r/932399 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [12:39:39] (03PS5) 10Jcrespo: dbbackups: Make backups statistics optional [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) [12:40:17] (03CR) 10Elukey: [C: 03+2] Move drmrs Varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932219 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [12:40:41] !log move varnishkafka drmrs instances to pki [12:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:38] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Make backups statistics optional [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [12:43:11] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Quiddity) >>! In T338765#8958355, @MatthewVernon wrote: > Let me fix collab first, and then I think we can close here. Urbanecm_WMF mentioned above that he sees the same issue at "stewardwiki [an... [12:43:58] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Urbanecm_WMF) >>! In T338765#8958408, @Quiddity wrote: >>>! In T338765#8958355, @MatthewVernon wrote: >> Let me fix collab first, and then I think we can close here. > > Urbanecm_WMF mentioned ab... [12:44:30] fixing a small logical bug [12:45:56] (03PS15) 10Muehlenhoff: ferm: Allow passing sets to an srange or drange [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [12:47:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:48:56] (03PS1) 10Jcrespo: dbbackups: Fix small logical backup for no-stats-file case [puppet] - 10https://gerrit.wikimedia.org/r/932403 (https://phabricator.wikimedia.org/T339894) [12:50:04] (03PS2) 10Jcrespo: dbbackups: Fix small logical backup for no-stats-file case [puppet] - 10https://gerrit.wikimedia.org/r/932403 (https://phabricator.wikimedia.org/T339894) [12:50:51] (03PS1) 10Slyngshede: P:netbox Redirect to idp on OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/932404 [12:50:59] (03CR) 10Jcrespo: "this made puppet fail for cloud dev hosts." [puppet] - 10https://gerrit.wikimedia.org/r/932403 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [12:52:02] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Ladsgroup) >>! In T338765#8958408, @Quiddity wrote: >>>! In T338765#8958355, @MatthewVernon wrote: >> Let me fix collab first, and then I think we can close here. > > Urbanecm_WMF mentioned above... [12:52:04] (03CR) 10CI reject: [V: 04-1] dbbackups: Fix small logical backup for no-stats-file case [puppet] - 10https://gerrit.wikimedia.org/r/932403 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [12:52:39] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:52:59] (03PS1) 10Muehlenhoff: Extend access for mnz and trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/932405 [12:53:05] (03PS3) 10Jcrespo: dbbackups: Fix small logical backup for no-stats-file case [puppet] - 10https://gerrit.wikimedia.org/r/932403 (https://phabricator.wikimedia.org/T339894) [12:54:21] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [12:55:39] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for mnz and trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/932405 (owner: 10Muehlenhoff) [12:56:21] (03CR) 10CI reject: [V: 04-1] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [12:56:25] (03PS2) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [12:56:54] (03PS3) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [12:57:41] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Fix small logical backup for no-stats-file case [puppet] - 10https://gerrit.wikimedia.org/r/932403 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [13:04:49] (03CR) 10Gehel: "PCC looks good. Minor changes that will be applied:" [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [13:04:51] (03CR) 10Jcrespo: [C: 03+1] "typo, but otherwise +1" [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [13:06:55] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:50] (03CR) 10Jelto: [C: 03+2] gitlab runner: Allow mariadb:* images [puppet] - 10https://gerrit.wikimedia.org/r/932328 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [13:13:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:18:35] (03CR) 10Muehlenhoff: P:toolforge: aptly: add a system user to own the repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932397 (https://phabricator.wikimedia.org/T340180) (owner: 10Majavah) [13:24:15] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:27:56] (03PS1) 10Andrew Bogott: wmcs-backup.py: Don't choke if a VM is deleted while we're backing up [puppet] - 10https://gerrit.wikimedia.org/r/932411 [13:29:57] !log add 200G to prometheus/k8s in eqiad [13:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:55] !log update private wiki container ACLs in codfw-swift [13:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:50] (03PS3) 10Aqu: Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) [13:32:14] (03PS3) 10Elukey: Move esams varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) [13:32:36] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the patch, please replace frpig1001 from hieradata/common.yaml with the new hostname too" [puppet] - 10https://gerrit.wikimedia.org/r/932257 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [13:35:22] !log update private wiki container ACLs in eqiad-swift [13:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41970/console" [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:35:44] (03PS3) 10Jelto: sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) [13:35:59] (03CR) 10Gehel: [C: 03+1] query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [13:36:05] (03PS1) 10Elukey: cassandra: add support for shorter TLS cert expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) [13:37:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41971/console" [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [13:37:47] (03PS1) 10Urbanecm: Section images: Placeholder should serialize to empty string [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/932280 (https://phabricator.wikimedia.org/T340170) [13:39:49] 10SRE, 10SRE-swift-storage, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) [13:40:20] (03PS4) 10Jelto: sre: add gitlab ci alerts [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) [13:40:36] (ProbeDown) firing: (2) Service releases1002:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:58] 10SRE, 10SRE-swift-storage, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Right, I think I have fixed this on all private wikis. [13:43:08] (03PS1) 10EoghanGaffney: releases: Fix alert for releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/932414 [13:44:46] @jbond: @vgutierrez: @cwhite: @thcipriani: Hi, I would like to do a Friday deploy of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/932280 to fix a critical product issue. Can anyone approve that please? [13:45:03] (03PS4) 10Aqu: Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) [13:46:06] urbanecm: I don't have the knowledge or the background to review that CR thoroughly [13:46:58] (03PS5) 10Aqu: Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) [13:47:06] godog around? I have a followup question re. the prometheus exporter and frpig1001 [13:47:16] Jeff_Green: hey, sure! [13:47:38] is that the service that's polling SSL certs? [13:47:42] vgutierrez: It's been reviewed internally within my team (Growth), see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/932341. I basically need someone to stand by in case the deploy harms our infra for some reason per https://wikitech.wikimedia.org/wiki/Deployments/Emergencies. [13:48:29] Jeff_Green: in this case only ping, I'm assuming we're talking about blackbox_smoke_hosts in hiera [13:48:48] urbanecm: ack, you got SREs on call in both sides of the pond ATM [13:49:00] yeah, I'm wondering if it would make more sense to use the public service hostname instead of the internal one? [13:49:26] vgutierrez: is that a sre approval for me doing the deployment? :)) [13:49:41] godog: although I don't know if the exporter would be able to route to the external side of the fr firewalls [13:50:58] (03CR) 10Jelto: sre: add gitlab ci alerts (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [13:51:03] Jeff_Green: I'm looking at the list of hostnames and we do ping frbast-eqiad.wikimedia.org already, I believe (but I'm not sure) that is to cover the public part/routing, and frpig for the internal bits, with all that said I don't feel strongly either way [13:51:59] godog: ok thinking... [13:52:15] (03CR) 10Jaime Nuche: releases: Fix alert for releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [13:52:28] (03CR) 10Vgutierrez: [C: 03+1] "on Monday please :)" [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:52:46] vgutierrez: what can go wrong on a friday afternoon??? :D :D [13:52:55] (03CR) 10EoghanGaffney: releases: Fix alert for releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [13:53:09] elukey: you can proceed as soon as I'm not on call [13:53:24] (03CR) 10Elukey: [C: 03+1] analytics: Decommission analytics106[4-6] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930582 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [13:53:25] godog: ok, I'm going to try to switch to the external hostnames for frpig1002 and frpig2001 [13:53:40] (03PS6) 10Aqu: Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) [13:53:47] Jeff_Green: ok, what are those? I can quickly test if they are reachable from prometheus eqiad [13:54:24] (03CR) 10Majavah: dev env: sshd, allow for user CA based auth (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:54:29] (03PS1) 10Muehlenhoff: Remove expiry date for jm [puppet] - 10https://gerrit.wikimedia.org/r/932417 [13:54:34] (03CR) 10Urbanecm: [C: 03+2] Section images: Placeholder should serialize to empty string [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/932280 (https://phabricator.wikimedia.org/T340170) (owner: 10Urbanecm) [13:54:38] godog: payments-listener-eqiad.wikimedia.org and payments-listener-codfw.wikimedia.org, and we use CNAME payments-listener.wikimedia.org to point to the active server [13:55:10] (03CR) 10Jelto: [C: 03+1] "lgtm, adding dzahn as CC because he configured the probe" [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [13:55:12] Jeff_Green: yep looks good to me! works as expected [13:55:14] so using the site-specific A records allows us to continue monitoring both, but should be lower maintenance [13:55:24] godog: ok great, I'll submit a patch [13:55:29] cheers! appreciate it [13:55:31] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:55] (03CR) 10Jaime Nuche: [C: 03+1] releases: Fix alert for releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [13:56:32] (03PS7) 10Aqu: Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) [13:58:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (modulo ongoing comments/convo), see nitline" [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [13:58:46] (03CR) 10Muehlenhoff: [C: 03+2] Remove expiry date for jm [puppet] - 10https://gerrit.wikimedia.org/r/932417 (owner: 10Muehlenhoff) [13:59:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:00:01] (03PS1) 10Jgreen: Switch blackbox_smoke_hosts check from frpig.* to payments-listener-.* [puppet] - 10https://gerrit.wikimedia.org/r/932420 (https://phabricator.wikimedia.org/T319460) [14:01:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Switch blackbox_smoke_hosts check from frpig.* to payments-listener-.* [puppet] - 10https://gerrit.wikimedia.org/r/932420 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [14:01:45] godog: thank you! [14:01:53] Jeff_Green: for sure! thank you for following up [14:03:15] ^^ parse1002 is expected? [14:03:40] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41972/console" [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) (owner: 10Aqu) [14:04:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:05:52] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:07:09] (03CR) 10Jgreen: Remove frpig1001 from nsca_frack.cfg.erb in prep for decom. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932257 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [14:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/932280 (https://phabricator.wikimedia.org/T340170) (owner: 10Urbanecm) [14:10:01] vgutierrez: expired downtime, the server has hardware issues [14:10:21] (03CR) 10EoghanGaffney: [C: 03+2] releases: Fix alert for releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [14:10:23] moritzm: yeah.. I was staring at T339340 [14:10:23] T339340: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 [14:12:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: HW issues [14:12:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: HW issues [14:12:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=10c1e3f4-0e17-4298-b048-57e9021a0c6f) set by vgutierrez@cumin1001 for 7 d... [14:13:56] (03PS1) 10Muehlenhoff: Extend access for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/932421 [14:17:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:01] (03PS4) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [14:18:31] (03CR) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [14:18:49] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [14:20:03] (03PS5) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [14:20:05] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/932421 (owner: 10Muehlenhoff) [14:20:33] (03Merged) 10jenkins-bot: Section images: Placeholder should serialize to empty string [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/932280 (https://phabricator.wikimedia.org/T340170) (owner: 10Urbanecm) [14:20:36] (ProbeDown) resolved: (2) Service releases1002:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:932280|Section images: Placeholder should serialize to empty string (T340170)]] [14:20:52] T340170: Section-level images: Placeholder gets saved in wiktext on rejection - https://phabricator.wikimedia.org/T340170 [14:20:55] (03PS2) 10Ssingh: admin: update membership for deployment group [puppet] - 10https://gerrit.wikimedia.org/r/931675 (https://phabricator.wikimedia.org/T339936) [14:21:22] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route pool sessionstore in codfw: maintenance [14:23:47] (03CR) 10Ssingh: [C: 03+2] admin: update membership for deployment group [puppet] - 10https://gerrit.wikimedia.org/r/931675 (https://phabricator.wikimedia.org/T339936) (owner: 10Ssingh) [14:24:20] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team, 10Patch-For-Review: Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10ssingh) [14:25:46] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team, 10Patch-For-Review: Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10ssingh) >>! In T339936#8951911, @taavi wrote: > `deployment` includes `deploy-service` rights, so granting both is... [14:26:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in codfw: maintenance [14:27:44] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:932280|Section images: Placeholder should serialize to empty string (T340170)]] (duration: 06m 56s) [14:27:48] T340170: Section-level images: Placeholder gets saved in wiktext on rejection - https://phabricator.wikimedia.org/T340170 [14:32:46] (03CR) 10Jcrespo: [C: 03+1] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [14:33:15] (03CR) 10Eevans: cassandra: add support for shorter TLS cert expiry checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:36:48] (03PS2) 10Elukey: cassandra: add support for shorter TLS cert expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) [14:37:06] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:37:39] (03CR) 10Eevans: [C: 03+1] cassandra: add support for shorter TLS cert expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:37:44] (03CR) 10Elukey: cassandra: add support for shorter TLS cert expiry checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:37:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41974/console" [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:38:47] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra: add support for shorter TLS cert expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/932413 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [14:41:03] 10ops-eqiad: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) [14:41:29] 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) [14:42:41] 10ops-eqiad, 10Data-Platform-SRE: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) [14:44:12] ACKNOWLEDGEMENT - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Requested a battery replacement in T340204 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:50:18] (ProbeDown) firing: (2) Service ml-cache1002:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ml-cache1002:7001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:55] ufff [14:54:53] ah yes of course, the probe is not hitting ml-cache1002-a [14:57:42] (03CR) 10FNegri: "I'm slightly confused by the fact that the latest PCC does not show the expected change to /etc/cumin/ssh_config in cumin1001 and cuminunp" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [14:58:51] (03PS1) 10Elukey: cassandra::instance::monitoring: fix tcp alert [puppet] - 10https://gerrit.wikimedia.org/r/932424 (https://phabricator.wikimedia.org/T288470) [15:02:08] (03PS2) 10Elukey: cassandra::instance::monitoring: fix tcp alert [puppet] - 10https://gerrit.wikimedia.org/r/932424 (https://phabricator.wikimedia.org/T288470) [15:03:30] (03PS3) 10TChin: eventstreams use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) [15:04:08] (03CR) 10TChin: eventstreams use kafka egress and service mesh (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin) [15:07:57] thanks for the backport urbanecm [15:08:16] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:10:05] No problem thcipriani [15:12:32] (03PS6) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:13:18] (03CR) 10CI reject: [V: 04-1] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:13:31] (03PS3) 10Elukey: cassandra::instance::monitoring: fix tcp alert [puppet] - 10https://gerrit.wikimedia.org/r/932424 (https://phabricator.wikimedia.org/T288470) [15:16:00] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41978/console" [puppet] - 10https://gerrit.wikimedia.org/r/932424 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:17:33] (03PS6) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [15:17:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:17:58] (03CR) 10CI reject: [V: 04-1] dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:18:04] (03CR) 10Elukey: [V: 03+1 C: 03+2] cassandra::instance::monitoring: fix tcp alert [puppet] - 10https://gerrit.wikimedia.org/r/932424 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:18:33] (03CR) 10JHathaway: dev env: sshd, allow for user CA based auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:18:48] (03CR) 10David Caro: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [15:20:18] (ProbeDown) firing: (12) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:14] (03PS7) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [15:24:53] (03PS8) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [15:25:41] (03PS7) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:26:04] (03CR) 10CI reject: [V: 04-1] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:27:21] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:31:02] (03PS8) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:34:01] (03PS1) 10Elukey: cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) [15:34:24] (03CR) 10CI reject: [V: 04-1] cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:35:52] (03PS2) 10Elukey: cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) [15:37:53] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41980/console" [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:37:57] (03PS9) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:39:21] (03PS10) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:41:27] (03PS11) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:42:39] (03PS12) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:42:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "o11y part LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:44:42] (03PS1) 10Btullis: Fix the networkpolicy selector for datahub maintenance jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932428 (https://phabricator.wikimedia.org/T329514) [15:45:36] (03PS13) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:45:53] 10SRE-Sprint-Week-Sustainability-March2023, 10DynamicPageList (Wikimedia), 10serviceops-radar, 10Performance Issue, and 6 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Pppery) [15:46:15] (03CR) 10Btullis: [C: 03+2] Fix the networkpolicy selector for datahub maintenance jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932428 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:47:07] (03Merged) 10jenkins-bot: Fix the networkpolicy selector for datahub maintenance jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932428 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:47:51] (03PS14) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:48:14] (03CR) 10CI reject: [V: 04-1] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:48:29] (03PS15) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:48:52] (03CR) 10CI reject: [V: 04-1] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:49:38] (03PS16) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:50:18] (ProbeDown) firing: (18) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:38] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:52:38] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC looks correct https://puppet-compiler.wmflabs.org/output/932406/41984/" [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:54:15] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Kryo memcached transcoder broken in CAS 6.3/6.4 - https://phabricator.wikimedia.org/T273867 (10Pppery) [15:54:32] (03PS17) 10Arturo Borrero Gonzalez: openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) [15:55:36] (ProbeDown) firing: (2) Service releases1002:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:56:20] (03CR) 10Ottomata: "LGTM, let's merge this Monday and deploy together? :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (https://phabricator.wikimedia.org/T335024) (owner: 10TChin) [15:57:11] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/932406/41985/" [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:57:19] 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Pppery) [15:57:42] (03CR) 10Jcrespo: [C: 03+1] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [15:57:49] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: pdns: add grants for DB backups [puppet] - 10https://gerrit.wikimedia.org/r/932406 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [16:02:19] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:03:46] (03PS1) 10Arturo Borrero Gonzalez: pdns_server: db_backups: avoid <<< redirection [puppet] - 10https://gerrit.wikimedia.org/r/932430 (https://phabricator.wikimedia.org/T339894) [16:04:22] (03CR) 10Dzahn: "it's alerting again because the status is now back to 403 as it has been for quite some time. are people working on this service?" [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [16:05:53] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC https://puppet-compiler.wmflabs.org/output/932430/41986/" [puppet] - 10https://gerrit.wikimedia.org/r/932430 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [16:08:12] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T340225" [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [16:10:31] (03CR) 10Jaime Nuche: [C: 03+1] releases: Fix alert for releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [16:10:36] (ProbeDown) resolved: (2) Service releases1002:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1002:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:06] (03PS1) 10RLazarus: opentelemetry-collector: Set additionalProperties: true in values schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/932431 (https://phabricator.wikimedia.org/T324117) [16:16:22] (03PS2) 10RLazarus: opentelemetry-collector: Set additionalProperties: true in values schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/932431 (https://phabricator.wikimedia.org/T324117) [16:23:11] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [16:33:34] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:37:39] (03CR) 10RLazarus: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/932355 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [16:40:47] (03PS1) 10Dzahn: wikistats: replace Apache 2.2 with Apache 2.4 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932434 (https://phabricator.wikimedia.org/T338071) [16:43:01] (03PS1) 10Dzahn: contint: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) [16:47:30] (03PS2) 10Dzahn: wikistats: replace Apache 2.2 with Apache 2.4 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932434 (https://phabricator.wikimedia.org/T338071) [16:48:21] (03CR) 10Dzahn: "https://httpd.apache.org/docs/2.4/upgrading.html" [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [16:49:04] (03PS2) 10Dzahn: contint: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) [16:51:16] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudservices2005-dev), No backups: 3 (cloudservices1004, ...), Fresh: 127 jobs Jcrespo reported on ticket - The acknowledgement expires at: 2023-06-26 12:50:51. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:51:32] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet unable to PXE boot - https://phabricator.wikimedia.org/T340055 (10Eevans) >>! In T340055#8957230, @Dzahn wrote: > Could it be that the "1G RJ45/SFP converter" is the broken component? Could explain why you have light on server side but not on switch... [16:57:22] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41988/console" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:58:08] (03CR) 10Majavah: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [17:04:54] (03PS1) 10Dzahn: releases-jenkins: replace Apache 2.2 with 2.4 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) [17:05:58] (03CR) 10Dzahn: "https://stackoverflow.com/questions/51972679/how-to-block-a-specific-user-agent-in-apache" [puppet] - 10https://gerrit.wikimedia.org/r/932439 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [17:08:37] (03CR) 10Bking: [C: 03+2] airflow: Make Data Engineering primary contact [puppet] - 10https://gerrit.wikimedia.org/r/907992 (https://phabricator.wikimedia.org/T334522) (owner: 10Bking) [17:12:12] (03PS1) 10Dzahn: apache: replace Apache 2.2 access control syntax for Jenkins proxy [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) [17:12:32] (03PS2) 10Dzahn: contint: replace Apache 2.2 access control syntax for Jenkins proxy [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) [17:15:05] (03CR) 10Btullis: Add Airflow configuration to connect to DataHub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) (owner: 10Aqu) [17:16:56] (03PS1) 10Dzahn: webperf: replace Apache 2.2 with modern syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932441 (https://phabricator.wikimedia.org/T258686) [17:20:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Dzahn) I don't really see what we are solving by denying things when we know we still want and have to fix them in the future. Wit... [17:25:30] (03PS1) 10Dzahn: prometheus: replace Apache 2.2 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932443 (https://phabricator.wikimedia.org/T258686) [17:26:26] (03PS1) 10Dzahn: thanos: replace Apache 2.2 with modern syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932444 (https://phabricator.wikimedia.org/T258686) [17:28:41] (03PS2) 10Dzahn: prometheus: replace Apache 2.2 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932443 (https://phabricator.wikimedia.org/T258686) [17:34:50] (03PS1) 10Dzahn: graphite: replace Apache 2.2 access control syntax [puppet] - 10https://gerrit.wikimedia.org/r/932445 (https://phabricator.wikimedia.org/T258686) [17:42:19] (03PS1) 10Dzahn: mediawiki: replace Apache 2.2 syntax for access control [puppet] - 10https://gerrit.wikimedia.org/r/932447 [18:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:27:21] (03PS1) 10Ottomata: refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) [18:27:58] (03CR) 10CI reject: [V: 04-1] refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [18:29:08] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41989/console" [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [18:46:01] (03PS1) 10JHathaway: stdlib: upgrade to v8.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) [18:47:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932459 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:55:00] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) For US states District of Columbia (DC), Ohio (OH) and Virginia (VA), we are getting abnorm... [18:58:10] (03PS2) 10Ottomata: refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) [18:58:37] (03CR) 10CI reject: [V: 04-1] refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [19:02:47] (03PS3) 10Ottomata: refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) [19:03:11] (03CR) 10CI reject: [V: 04-1] refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [19:04:08] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41990/console" [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) (owner: 10Ottomata) [19:08:19] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:17] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:09:19] (03CR) 10EoghanGaffney: [C: 03+2] releases: Fix alert for releases-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932414 (owner: 10EoghanGaffney) [19:17:56] (03PS4) 10Ottomata: refinery::job::canary_events - use spark to launch, bump to version 0.2.17 [puppet] - 10https://gerrit.wikimedia.org/r/932456 (https://phabricator.wikimedia.org/T330236) [19:21:19] (03PS1) 10JHathaway: site.pp: Drop wmnet domain and always use regexes [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) [19:22:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [19:28:57] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:01] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:31:58] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops, 10WMF-NDA: reconfigure 1:1 NAT for new eqiad frmon host - https://phabricator.wikimedia.org/T340252 (10Dwisehaupt) [19:50:18] (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:56:21] (03CR) 10JHathaway: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932431 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [19:57:44] (03CR) 10JHathaway: "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932355 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [20:03:33] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) After updating the mappings of 8 countries ([930293](https://gerrit.wikimedia.org/r/930293)... [20:08:28] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Set additionalProperties: true in values schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/932431 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [20:09:15] (03Merged) 10jenkins-bot: opentelemetry-collector: Set additionalProperties: true in values schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/932431 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [20:09:47] (03PS1) 10Jameel Kaisar: Probenet: Restore mapping for Nigeria [dns] - 10https://gerrit.wikimedia.org/r/932468 (https://phabricator.wikimedia.org/T337318) [20:12:43] (03PS2) 10Jameel Kaisar: Probenet: Restore mapping for Nigeria [dns] - 10https://gerrit.wikimedia.org/r/932468 (https://phabricator.wikimedia.org/T337318) [20:13:36] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932355 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [20:14:24] (03Merged) 10jenkins-bot: opentelemetry-collector: Add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/932355 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [21:24:36] (03CR) 10Eevans: [C: 03+1] cassandra::instance::monitoring: move alerts to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/932427 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [21:45:42] (03CR) 10Dzahn: [C: 03+2] "tested in cloud. it also would work to remove this line completely. But "Require all denied" would deny users and "Require all granted" is" [puppet] - 10https://gerrit.wikimedia.org/r/932434 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [22:00:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10wiki_willy) a:03Jclark-ctr [22:01:19] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10wiki_willy) a:03Jclark-ctr [22:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:25:24] (03PS1) 10BryanDavis: openstack: Fix YAML syntax error in /etc/novaobserver.yaml [puppet] - 10https://gerrit.wikimedia.org/r/932516 [22:27:04] (03CR) 10BryanDavis: "I noticed this while tailing logs for openstack-browser:" [puppet] - 10https://gerrit.wikimedia.org/r/932516 (owner: 10BryanDavis) [23:29:10] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Fix YAML syntax error in /etc/novaobserver.yaml [puppet] - 10https://gerrit.wikimedia.org/r/932516 (owner: 10BryanDavis) [23:50:18] (ProbeDown) firing: (6) Service ml-cache1001:7001 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown