[00:05:21] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: dispatch-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:03] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:31] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:20:55] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:09] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:06:31] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) (owner: 10Filippo Giunchedi) [01:07:18] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/889494 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [01:12:49] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:29:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:39] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [01:48:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:54:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:55:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:06:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:36:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:05:57] (03PS1) 10KartikMistry: Enable Section Translation in 9 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889656 (https://phabricator.wikimedia.org/T323825) [05:19:25] (03CR) 10Stevemunene: [C: 03+1] Add a postgresql database and user for airflow_search_platform [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [05:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:26:09] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] * kart_ updating cxserver.. [05:53:52] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-02-15-085109-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889483 (https://phabricator.wikimedia.org/T328310) (owner: 10KartikMistry) [05:58:56] (03Merged) 10jenkins-bot: Update cxserver to 2023-02-15-085109-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889483 (https://phabricator.wikimedia.org/T328310) (owner: 10KartikMistry) [06:00:14] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:00:38] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:05:34] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:06:31] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:11:01] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:11:56] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:15:59] !log Updated cxserver to 2023-02-15-085109-production (T328310, T110190, T116466) [06:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:05] T116466: Simple English Wikipedia issues with ContentTranslation (tracking) - https://phabricator.wikimedia.org/T116466 [06:16:06] T110190: Rename "simple" wikis to "en-simple" - https://phabricator.wikimedia.org/T110190 [06:16:06] T328310: CX doesn't load when testing backports on mwdebug servers - https://phabricator.wikimedia.org/T328310 [06:30:53] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:32:39] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:39:49] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:41:35] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T0700) [07:00:04] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T0700). [07:22:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:23:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:25:18] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-airflow1005.eqiad.wmnet with OS buster [07:30:19] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:32:05] (03PS1) 10Elukey: Replace underscores with hyphens in ml-serve's etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/889661 (https://phabricator.wikimedia.org/T324542) [07:33:08] (03PS1) 10Ryan Kemper: wdqs: no longer page on failed probe [puppet] - 10https://gerrit.wikimedia.org/r/889662 (https://phabricator.wikimedia.org/T325324) [07:34:40] (03PS1) 10Elukey: role::etcd::v3::ml_etcd: use PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/889663 (https://phabricator.wikimedia.org/T324542) [07:35:55] 10SRE-OnFire, 10Discovery-Search (Current work), 10Patch-For-Review, 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10RKemper) [07:36:13] (03PS2) 10Ryan Kemper: wdqs: no longer page on failed probe [puppet] - 10https://gerrit.wikimedia.org/r/889662 (https://phabricator.wikimedia.org/T325324) [07:36:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39643/console" [puppet] - 10https://gerrit.wikimedia.org/r/889663 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [07:39:24] !log powercycle parse1012 - CPU1 errors registered in `racadm getsel` [07:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:13] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:23] !log depool parse1012 to allow the service ops team to check it [07:41:23] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49566 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:54:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:59:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:00:04] Amir1, apergos, and jnuche: gettimeofday() says it's time for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T0800) [08:00:04] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:17] * kart_ is here [08:00:37] morning! [08:00:46] apergos: Good Morning! [08:00:49] no trainees are signed up today [08:01:01] OK. I'll go ahead with self deploy :) [08:01:06] your patch is the only one, and it looks pretty straight forward [08:01:12] self deploy ho! [08:01:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:02:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889656 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry) [08:02:39] (03Merged) 10jenkins-bot: Enable Section Translation in 9 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889656 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry) [08:03:04] !log kartik@deploy1002 Started scap: Backport for [[gerrit:889656|Enable Section Translation in 9 Wikipedias (T323825 T304865)]] [08:03:10] T323825: Enable Content and Section translation on 8 Wikipedias - https://phabricator.wikimedia.org/T323825 [08:03:10] T304865: Enable Content and Section Translation for Cantonese Wikipedia - https://phabricator.wikimedia.org/T304865 [08:05:02] !log kartik@deploy1002 kartik: Backport for [[gerrit:889656|Enable Section Translation in 9 Wikipedias (T323825 T304865)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:08:17] (03PS1) 10Muehlenhoff: Remove access for mepps [puppet] - 10https://gerrit.wikimedia.org/r/889749 [08:10:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for mepps [puppet] - 10https://gerrit.wikimedia.org/r/889749 (owner: 10Muehlenhoff) [08:15:43] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:889656|Enable Section Translation in 9 Wikipedias (T323825 T304865)]] (duration: 12m 38s) [08:15:48] T323825: Enable Content and Section translation on 8 Wikipedias - https://phabricator.wikimedia.org/T323825 [08:15:49] T304865: Enable Content and Section Translation for Cantonese Wikipedia - https://phabricator.wikimedia.org/T304865 [08:16:26] I'm done with deployment, apergos [08:17:14] great, thanks a lot! [08:17:25] PROBLEM - puppet last run on an-presto1015 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:17:37] !log UTC morning backport and config training window done [08:17:39] PROBLEM - puppet last run on an-presto1006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:41] PROBLEM - puppet last run on an-presto1007 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:19:17] PROBLEM - puppet last run on an-presto1013 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:19:27] PROBLEM - puppet last run on an-presto1011 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:19:57] (03PS3) 10Nicolas Fraison: perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) [08:19:59] (03PS3) 10Nicolas Fraison: perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) [08:20:02] (03PS4) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [08:21:45] PROBLEM - puppet last run on an-presto1009 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:45] PROBLEM - puppet last run on an-presto1014 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:53] (03CR) 10CI reject: [V: 04-1] perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [08:23:31] RECOVERY - puppet last run on an-presto1006 is OK: OK: Puppet is currently disabled (Create presto cluster for perf testing - T329525 - nfraison), not alerting. Last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:25:55] !log upgrading cassandra-dev to Java 8u362-ga-4 [08:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:24] (03PS1) 10Slyngshede: P:IDM Configure production VMs [puppet] - 10https://gerrit.wikimedia.org/r/889751 [08:30:24] (03CR) 10Slyngshede: "I wanted to try having Puppet correct, before actually creating the VMs." [puppet] - 10https://gerrit.wikimedia.org/r/889751 (owner: 10Slyngshede) [08:33:28] (03CR) 10Muehlenhoff: P:IDM Configure production VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889751 (owner: 10Slyngshede) [08:34:55] (03PS2) 10Slyngshede: P:IDM Configure production VMs [puppet] - 10https://gerrit.wikimedia.org/r/889751 [08:35:15] (03CR) 10Slyngshede: P:IDM Configure production VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889751 (owner: 10Slyngshede) [08:36:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 9584 [08:36:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 9584 [08:36:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:43:36] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Vopsbot doesn't have channel topic rights - https://phabricator.wikimedia.org/T329791 (10Peachey88) >>! In T329791#8620308, @Legoktm wrote: > On that note, does the private SRE channel even have chanserv? I know _security doesn't. If its r... [08:45:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10Peachey88) [08:51:21] (03CR) 10Jelto: [C: 03+2] gitlab: remove dedicated restore logfile and log to syslog only [puppet] - 10https://gerrit.wikimedia.org/r/889546 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [09:00:28] (03PS1) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [09:00:49] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [09:01:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:02:05] (03PS2) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [09:03:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889751 (owner: 10Slyngshede) [09:04:39] (03CR) 10Slyngshede: [C: 03+2] P:IDM Configure production VMs [puppet] - 10https://gerrit.wikimedia.org/r/889751 (owner: 10Slyngshede) [09:07:49] !log uploaded openjdk-8 8u362-ga-4~deb10u1 to component/jdk8 for buster-wikimedia (forward port of latest Java 8 security release) [09:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:58] (03CR) 10David Caro: "LGTM, let's merge today and test it 👍" [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [09:11:02] (03CR) 10David Caro: [C: 03+1] puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [09:16:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:20:06] (03CR) 10Jelto: [C: 03+2] "I scheduled a manual restore and journal contains a logs now. I'll remove the unmanaged dedicated log file" [puppet] - 10https://gerrit.wikimedia.org/r/889546 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [09:20:15] (03CR) 10ClĂ©ment Goubert: [V: 03+1 C: 03+2] P:httpb: Make PCC happy [puppet] - 10https://gerrit.wikimedia.org/r/889573 (owner: 10ClĂ©ment Goubert) [09:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:22:56] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Use check_ssl_http_letsencrypt for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/889571 (owner: 10Vgutierrez) [09:23:14] (03CR) 10Vgutierrez: [C: 03+2] icinga: Use check_ssl_http_letsencrypt for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/889571 (owner: 10Vgutierrez) [09:23:53] taavi: re: PuppetCertificateAboutToExpire above, what's the recommended action in this case ? [09:25:13] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set logs-api in 'production' [puppet] - 10https://gerrit.wikimedia.org/r/889494 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:27:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [09:27:30] 10SRE-swift-storage: Thanos root filesystem filling with logs - https://phabricator.wikimedia.org/T329712 (10ops-monitoring-bot) Host rebooted by mvernon@cumin1001 with reason: restart to try and get logging going again [09:28:39] (03CR) 10Filippo Giunchedi: alertmanager: tweak default incident text/description (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) (owner: 10Filippo Giunchedi) [09:30:12] (03PS1) 10Nicolas Fraison: feat(presto): export splits and thread metrics [puppet] - 10https://gerrit.wikimedia.org/r/889756 (https://phabricator.wikimedia.org/T329525) [09:30:46] (03PS3) 10Filippo Giunchedi: alertmanager: tweak default incident text/description [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) [09:31:19] godog: per T319217 I think those should be just revoked [09:31:19] T319217: decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 [09:33:47] taavi: ah yeah that makes sense, I'll do it [09:35:13] (03PS1) 10Muehlenhoff: Update Airflow alias [puppet] - 10https://gerrit.wikimedia.org/r/889757 [09:35:30] !log puppet cert clean labstore100[67] - T319217 [09:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [09:42:29] (PuppetCertificateAboutToExpire) resolved: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:42:51] (03PS1) 10Elukey: cumin: add more aliases for aqs [puppet] - 10https://gerrit.wikimedia.org/r/889758 [09:46:24] (03PS1) 10Elukey: Add istio and kserve settings for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) [09:46:40] (03PS1) 10Elukey: sre.cassandra: refactor aliases and update roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 [09:47:23] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) [09:48:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) [09:53:04] (03CR) 10Klausman: [C: 03+1] sre.cassandra: refactor aliases and update roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 (owner: 10Elukey) [09:53:18] (03CR) 10Klausman: [C: 03+1] role::etcd::v3::ml_etcd: use PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/889663 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [09:53:47] (03CR) 10Klausman: [C: 03+1] cumin: add more aliases for aqs [puppet] - 10https://gerrit.wikimedia.org/r/889758 (owner: 10Elukey) [09:53:57] vgutierrez jelto heads up I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/889567 and will subsequently issue a test page [09:54:09] ook [09:54:10] (03CR) 10Muehlenhoff: cumin: add more aliases for aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889758 (owner: 10Elukey) [09:54:25] ack [09:54:48] (03PS4) 10Ayounsi: Default L2 interfaces to MTU 9192 if not set from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [09:54:50] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: tweak default incident text/description [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) (owner: 10Filippo Giunchedi) [09:55:04] (03CR) 10Btullis: [C: 03+1] sre.cassandra: refactor aliases and update roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 (owner: 10Elukey) [09:55:24] (03CR) 10Btullis: [C: 03+2] Add a postgresql database and user for airflow_search_platform [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [09:55:58] (03CR) 10Muehlenhoff: sre.cassandra: refactor aliases and update roll restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 (owner: 10Elukey) [09:56:29] (03PS2) 10Elukey: cumin: add more aliases for aqs [puppet] - 10https://gerrit.wikimedia.org/r/889758 [09:56:36] (03CR) 10Elukey: cumin: add more aliases for aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889758 (owner: 10Elukey) [09:57:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889758 (owner: 10Elukey) [09:57:48] (03Abandoned) 10Btullis: Create new airflow package for version 2.3.2 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/832292 (https://phabricator.wikimedia.org/T317210) (owner: 10Btullis) [09:58:32] !log issue test page with: amtool alert add TestPage address=6.6.6.6 team=sre severity=page job=testjob --annotation=runbook=lol --annotation=description='this is a test page, please ignore' --annotation=dashboard=no [09:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:04] (TestPage) firing: - lol - no - https://alerts.wikimedia.org/?q=alertname%3DTestPage [09:59:14] (03CR) 10Ayounsi: [C: 03+1] "One suggestion but the change LGTM as it!" [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [09:59:23] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host idm1001.wikimedia.org [09:59:25] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [09:59:49] (03CR) 10Vgutierrez: [C: 03+1] netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:00:00] vgutierrez jelto how's looking? the notification should be more readable [10:00:01] (03CR) 10Vgutierrez: [C: 03+1] netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:00:28] godog: gotta hate splunk [10:00:49] I'm with you on that [10:01:15] the page didn't make my phone sound [10:01:40] that's interesting and slightly concerning at the same time [10:01:43] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idm1001.wikimedia.org - slyngshede@cumin1001" [10:02:04] godog: pa.ge received. the text is as follows [10:02:06] no this is a test pag.e, please ignore lol Alerts Firing: Labels: - alertname = TestPa.ge - address = 6.6.6.6 - job = testjob - severity = pa.ge - team = sre Annotations: - dashboard = no - des... [10:02:28] woot woot, thank you [10:02:34] I'll resolve [10:02:48] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idm1001.wikimedia.org - slyngshede@cumin1001" [10:02:49] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:02:49] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache idm1001.wikimedia.org on all recursors [10:02:53] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idm1001.wikimedia.org on all recursors [10:03:59] godog: ah the message was cropped, there are also some more annotations. Look at https://portal.victorops.com/ui/wikimedia/incident/3426/details [10:04:15] (03CR) 10Elukey: [C: 03+2] cumin: add more aliases for aqs [puppet] - 10https://gerrit.wikimedia.org/r/889758 (owner: 10Elukey) [10:04:39] indeed [10:05:02] (03CR) 10Elukey: [C: 03+2] Remove non-kafka logstash nodes from kafka configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/886862 (https://phabricator.wikimedia.org/T329142) (owner: 10Cwhite) [10:07:54] 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) I tested the new template today with the following (from an alert host) `... [10:12:33] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idm1001.wikimedia.org [10:15:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: psp: base-pod-security-policies.yaml: reformat file [puppet] - 10https://gerrit.wikimedia.org/r/870665 (owner: 10Arturo Borrero Gonzalez) [10:15:32] (03PS1) 10Superpes15: [simplewiki] Change to 'uca-ga-u-kn' category collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889762 (https://phabricator.wikimedia.org/T329815) [10:16:54] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes [puppet] - 10https://gerrit.wikimedia.org/r/870686 [10:17:09] (03PS3) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes [puppet] - 10https://gerrit.wikimedia.org/r/870686 [10:18:04] (03CR) 10Majavah: [C: 03+1] "Coming back to this, with the list of restricted paths this doesn't seem to be that bad of an issue." [puppet] - 10https://gerrit.wikimedia.org/r/870686 (owner: 10Arturo Borrero Gonzalez) [10:18:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes [puppet] - 10https://gerrit.wikimedia.org/r/870686 (owner: 10Arturo Borrero Gonzalez) [10:19:04] (TestPage) resolved: - lol - no - https://alerts.wikimedia.org/?q=alertname%3DTestPage [10:20:04] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host idm1001.wikimedia.org with OS bullseye [10:20:09] 10SRE, 10Infrastructure-Foundations: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bullseye [10:22:01] !log installing postgresql-11 security updates on maps* [10:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:59] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:31] (03PS1) 10Elukey: cumin: add cassandra-dev to aliases [puppet] - 10https://gerrit.wikimedia.org/r/889763 [10:24:17] (03CR) 10Elukey: sre.cassandra: refactor aliases and update roll restart (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 (owner: 10Elukey) [10:25:13] (03PS2) 10Elukey: sre.cassandra: refactor aliases and update roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 [10:26:09] (03CR) 10Muehlenhoff: [C: 03+1] cumin: add cassandra-dev to aliases [puppet] - 10https://gerrit.wikimedia.org/r/889763 (owner: 10Elukey) [10:26:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 (owner: 10Elukey) [10:27:22] (03CR) 10Elukey: [C: 03+2] cumin: add cassandra-dev to aliases [puppet] - 10https://gerrit.wikimedia.org/r/889763 (owner: 10Elukey) [10:28:36] (03CR) 10Elukey: [C: 03+2] sre.cassandra: refactor aliases and update roll restart [cookbooks] - 10https://gerrit.wikimedia.org/r/889761 (owner: 10Elukey) [10:31:03] (03CR) 10ClĂ©ment Goubert: [V: 03+1 C: 03+2] P:httpbb: Fix httpbb_kubernetes_hourly presence [puppet] - 10https://gerrit.wikimedia.org/r/889764 (owner: 10ClĂ©ment Goubert) [10:31:39] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm1001.wikimedia.org with reason: host reimage [10:34:42] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm1001.wikimedia.org with reason: host reimage [10:36:53] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [10:37:35] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [10:44:13] (03CR) 10ClĂ©ment Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/889767 (https://phabricator.wikimedia.org/T327977) (owner: 10ClĂ©ment Goubert) [10:45:55] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host idm1001.wikimedia.org with OS bullseye [10:45:59] 10SRE, 10Infrastructure-Foundations: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bullseye completed: - idm1001 (**PASS**) - Removed from... [10:48:54] (03PS1) 10Jelto: gitlab: merge gitlab-restore scripts [puppet] - 10https://gerrit.wikimedia.org/r/889768 (https://phabricator.wikimedia.org/T326315) [10:50:40] (03PS1) 10Arturo Borrero Gonzalez: bullseye-sssd/: add mysql client command line utility [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/889769 (https://phabricator.wikimedia.org/T320178) [10:51:04] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39647/console" [puppet] - 10https://gerrit.wikimedia.org/r/889768 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [10:54:27] (03CR) 10Jelto: [V: 03+1] "I tested the merged restore script on the test instance" [puppet] - 10https://gerrit.wikimedia.org/r/889768 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [10:55:28] PROBLEM - Kerberos KDC daemon on krb2001 is CRITICAL: PROCS CRITICAL: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [10:55:38] !log repool parse1012 for monitoring of possible CPU1 issues [10:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:41] wow [10:56:52] (03CR) 10Jelto: [C: 03+1] "lgtm, I'll let Daniel proceed here" [puppet] - 10https://gerrit.wikimedia.org/r/889767 (https://phabricator.wikimedia.org/T327977) (owner: 10ClĂ©ment Goubert) [10:58:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] requestctl: add mock requestctl data to be used in cloud [labs/private] - 10https://gerrit.wikimedia.org/r/888204 (owner: 10Jbond) [11:00:04] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1100). [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1100) [11:00:12] moritzm: doing anything with krb2001 atm? [11:00:37] There's an issue with a user removal that may be related [11:01:08] (a user can't be removed because a pid uses it) [11:01:28] it may be a red herring though [11:01:50] I see from the logs that the krb nodes are bombarded from an-presto requests [11:02:05] btullis, nfraison, steve_munene o/ [11:03:35] elukey: Nicolas and myself are currently testing some settings to eliminate a bottleneck for extending the Presto cluster [11:03:39] are you seeing some issue? [11:05:06] moritzm: yeah there was an alarm earlier on for krb2001, 1001 is showing the same.. I see the kdc daemon restarted some mins ago (not sure if you have done it or not) and from the logs there are a ton of requests from an-presto nodes [11:05:33] the restart was me [11:05:49] and requests were triggered by Nicolas [11:05:53] to debug https://phabricator.wikimedia.org/T329525 [11:05:59] just to clarify things this is the expected behavior from presto to issue lots of tgs-req when queries are running on it and it is the root cause of our issues to add more nodes to the cluster. so this pattern will persist once we will be able to add the nodes [11:07:22] PROBLEM - Kerberos KDC daemon on krb1001 is CRITICAL: PROCS CRITICAL: 17 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:07:47] nfraison: np thanks for the explanation, it may not be an issue but it was the only thing that I noticed while debugging. Next time please alert this change or #sre if you make experiments so we are aware :) [11:07:57] *this channel [11:16:32] (03PS1) 10Majavah: hieradata: deployment-prep: add certificate_name [puppet] - 10https://gerrit.wikimedia.org/r/889772 [11:17:43] (03PS1) 10Elukey: ml-services: update docker images for outlink and revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/889773 (https://phabricator.wikimedia.org/T328576) [11:20:12] (03PS1) 10Majavah: tlsproxy::localssl: fix cfssl usage [puppet] - 10https://gerrit.wikimedia.org/r/889774 [11:24:22] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10ClĂ©ment Goubert) [11:26:48] (03PS2) 10Majavah: tlsproxy::localssl: fix cfssl usage [puppet] - 10https://gerrit.wikimedia.org/r/889774 [11:26:50] (03PS2) 10Majavah: hieradata: deployment-prep: add certificate_name [puppet] - 10https://gerrit.wikimedia.org/r/889772 [11:28:03] (03CR) 10Btullis: Do not install spark2 on bullseye or later (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [11:28:05] (03CR) 10Btullis: [C: 03+2] Do not install spark2 on bullseye or later [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [11:29:16] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Clement_Goubert) The above patch removes `x2` from the core databases, and removes the now unu... [11:29:51] (03CR) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) (owner: 10Sergio Gimeno) [11:30:04] (03PS2) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) [11:30:09] (03CR) 10MarcoAurelio: [simplewiki] Change to 'uca-ga-u-kn' category collation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889762 (https://phabricator.wikimedia.org/T329815) (owner: 10Superpes15) [11:30:50] RECOVERY - Kerberos KDC daemon on krb1001 is OK: PROCS OK: 1 process with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:30:58] RECOVERY - Kerberos KDC daemon on krb2001 is OK: PROCS OK: 1 process with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:31:57] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:35:30] (03PS1) 10Jbond: configmaster: Add a switch to enable the nda subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/889776 [11:38:34] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:38:53] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:39:50] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:39:57] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:40:14] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Ladsgroup) Yes, that's the way we should do it given Manuel's comment above and my basic under... [11:41:12] (03PS2) 10Jbond: configmaster: Add a switch to enable the nda subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/889776 [11:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:47:14] PROBLEM - Kerberos KDC daemon on krb1001 is CRITICAL: PROCS CRITICAL: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:47:24] PROBLEM - Kerberos KDC daemon on krb2001 is CRITICAL: PROCS CRITICAL: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:47:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:49:25] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Clement_Goubert) Thanks @Ladsgroup once the spicerack release is done I'll test the cookbook p... [11:49:52] (03CR) 10ClĂ©ment Goubert: [C: 03+2] mysql_legacy: remove x2 handling logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10ClĂ©ment Goubert) [11:51:43] (03PS1) 10JMeybohm: wikikube istio: Remote the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/889780 (https://phabricator.wikimedia.org/T326729) [11:51:45] (03PS1) 10JMeybohm: aux istio: Add istio config for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889781 (https://phabricator.wikimedia.org/T329633) [11:53:35] (03Merged) 10jenkins-bot: mysql_legacy: remove x2 handling logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10ClĂ©ment Goubert) [11:54:04] ^ the KDC errors can be ignored, the monitoring needs some adapting to the tests that I'm running with nfraison [11:54:06] (03CR) 10CDanis: [C: 03+1] aux istio: Add istio config for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889781 (https://phabricator.wikimedia.org/T329633) (owner: 10JMeybohm) [11:54:09] we should be done soonish [11:54:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST gateways) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:55:37] (03CR) 10Jaime Nuche: [C: 03+1] jenkins: fix directory in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [11:58:08] RECOVERY - Kerberos KDC daemon on krb1001 is OK: PROCS OK: 1 process with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:59:25] (03PS2) 10Superpes15: [simplewiki] Change to 'uca-default-u-kn' category collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889762 (https://phabricator.wikimedia.org/T329815) [11:59:45] KDCs are back to normal now [12:00:08] RECOVERY - Kerberos KDC daemon on krb2001 is OK: PROCS OK: 1 process with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [12:00:18] (03CR) 10Superpes15: [simplewiki] Change to 'uca-default-u-kn' category collation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889762 (https://phabricator.wikimedia.org/T329815) (owner: 10Superpes15) [12:00:43] (03PS3) 10Jbond: configmaster: Add a switch to enable the nda subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/889776 [12:02:31] (03CR) 10JMeybohm: [C: 03+2] wikikube istio: Remote the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/889780 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [12:02:35] (03CR) 10JMeybohm: [C: 03+2] aux istio: Add istio config for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889781 (https://phabricator.wikimedia.org/T329633) (owner: 10JMeybohm) [12:02:49] (03CR) 10CI reject: [V: 04-1] configmaster: Add a switch to enable the nda subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [12:02:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39654/console" [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [12:03:07] (03CR) 10MarcoAurelio: [C: 03+1] "LGTM 😊 - Needs `mwscript updateCollation.php --wiki=simplewiki --previous-collation=uppercase` after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889762 (https://phabricator.wikimedia.org/T329815) (owner: 10Superpes15) [12:03:27] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:03:42] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:03:53] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:03:58] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:04:04] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:04:10] 10SRE, 10Infrastructure-Foundations: KDC performance tuning for TCP requests - https://phabricator.wikimedia.org/T329831 (10MoritzMuehlenhoff) [12:04:29] 10SRE, 10Infrastructure-Foundations: KDC performance tuning for TCP requests - https://phabricator.wikimedia.org/T329831 (10MoritzMuehlenhoff) p:05Triage→03Medium a:05nfraison→03MoritzMuehlenhoff [12:04:46] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:05:12] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:05:20] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:05:31] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:05:58] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:06:16] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:06:20] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:07:02] (03PS5) 10ClĂ©ment Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) [12:08:02] (03Merged) 10jenkins-bot: wikikube istio: Remote the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/889780 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [12:08:04] (03Merged) 10jenkins-bot: aux istio: Add istio config for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889781 (https://phabricator.wikimedia.org/T329633) (owner: 10JMeybohm) [12:09:30] (03CR) 10Jbond: [C: 04-1] "thanks for the Cr but see comments for possible alternate approaches" [puppet] - 10https://gerrit.wikimedia.org/r/889632 (owner: 10Majavah) [12:11:42] (03PS4) 10ClĂ©ment Goubert: sre.switchdc.services: import sre.discovery.datacenter excludes [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [12:16:06] (03PS3) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [12:16:33] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:16:51] (03PS4) 10Jbond: configmaster: Add a switch to enable the nda subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/889776 [12:17:19] (03PS4) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [12:17:42] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:22:33] (03PS6) 10ClĂ©ment Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [12:23:49] (03PS5) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [12:24:24] (03CR) 10ClĂ©ment Goubert: [C: 03+2] sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10ClĂ©ment Goubert) [12:26:09] (03Merged) 10jenkins-bot: sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10ClĂ©ment Goubert) [12:26:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39655/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:26:42] (03CR) 10Majavah: [C: 04-1] "This doesn't actually solve the problem of trying to reference the non-existent 00_defs_requestctl." [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [12:32:28] (03PS1) 10Muehlenhoff: package_builder: Add build hook for component/icu67 [puppet] - 10https://gerrit.wikimedia.org/r/889795 (https://phabricator.wikimedia.org/T329491) [12:32:42] (03PS7) 10ClĂ©ment Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [12:33:02] (03CR) 10CI reject: [V: 04-1] package_builder: Add build hook for component/icu67 [puppet] - 10https://gerrit.wikimedia.org/r/889795 (https://phabricator.wikimedia.org/T329491) (owner: 10Muehlenhoff) [12:34:59] (03PS2) 10EoghanGaffney: Add puppet role to new aphlict VM [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) [12:36:59] (03CR) 10ClĂ©ment Goubert: [C: 03+2] sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [12:37:24] (03PS1) 10Slyngshede: P:IDM secrets are mapped wrong. [labs/private] - 10https://gerrit.wikimedia.org/r/889798 [12:38:20] (03PS2) 10Muehlenhoff: package_builder: Add build hook for component/icu67 [puppet] - 10https://gerrit.wikimedia.org/r/889795 (https://phabricator.wikimedia.org/T329491) [12:38:32] (03CR) 10EoghanGaffney: [C: 03+2] Add puppet role to new aphlict VM [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:39:18] (03Merged) 10jenkins-bot: sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [12:39:21] (03PS6) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [12:43:19] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Add build hook for component/icu67 [puppet] - 10https://gerrit.wikimedia.org/r/889795 (https://phabricator.wikimedia.org/T329491) (owner: 10Muehlenhoff) [12:43:52] eoghan: I'm going to puppet-merge your patch along [12:44:08] done [12:44:21] moritzm: Thanks! Connection dropped out for a minute. [12:46:16] (03PS5) 10ClĂ©ment Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [12:47:04] (03PS1) 10Nicolas Fraison: feat(kerberos): rely on tcp first to query kdc [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) [12:47:16] (03PS6) 10ClĂ©ment Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [12:48:36] (03PS4) 10Nicolas Fraison: perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) [12:48:38] (03PS4) 10Nicolas Fraison: perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) [12:48:40] (03PS5) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [12:49:00] (03CR) 10CI reject: [V: 04-1] sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10ClĂ©ment Goubert) [12:49:57] (03PS7) 10ClĂ©ment Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [12:50:06] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10LSobanski) [12:51:03] 10SRE, 10Infrastructure-Foundations, 10LDAP: LDAP connections use TLSv1.0 and TLSv1.1 - https://phabricator.wikimedia.org/T329218 (10LSobanski) [12:53:10] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39659/console" [puppet] - 10https://gerrit.wikimedia.org/r/889756 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [12:55:25] (03PS8) 10ClĂ©ment Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [12:55:37] (03PS7) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [12:57:00] (03CR) 10CI reject: [V: 04-1] sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10ClĂ©ment Goubert) [12:57:02] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39660/console" [puppet] - 10https://gerrit.wikimedia.org/r/889756 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [13:02:24] (03PS5) 10Cathal Mooney: Default L2 interfaces to MTU 9212 if not set from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329799) [13:02:46] (03CR) 10Muehlenhoff: feat(kerberos): rely on tcp first to query kdc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [13:03:54] (03CR) 10Jelto: [C: 03+2] jenkins: fix directory in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [13:05:06] (03PS8) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [13:06:20] (03CR) 10Cathal Mooney: Default L2 interfaces to MTU 9212 if not set from Netbox (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [13:09:15] (03PS9) 10ClĂ©ment Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [13:10:17] (03CR) 10ClĂ©ment Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) (owner: 10ClĂ©ment Goubert) [13:10:30] (03PS2) 10ClĂ©ment Goubert: sre.switchdc.mediawiki: Remove ACTIVE_ACTIVE_SECTIONS [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) [13:12:44] (03PS9) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [13:12:56] (03PS2) 10Muehlenhoff: Update Airflow alias [puppet] - 10https://gerrit.wikimedia.org/r/889757 [13:16:56] (03PS10) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [13:17:17] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [13:17:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:08] (03CR) 10Muehlenhoff: [C: 03+2] Update Airflow alias [puppet] - 10https://gerrit.wikimedia.org/r/889757 (owner: 10Muehlenhoff) [13:18:20] (03PS11) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [13:18:37] (03CR) 10Jelto: "In comparison to executing new/other java files (https://gerrit.wikimedia.org/r/886911) the values for http proxy change very infrequent. " [puppet] - 10https://gerrit.wikimedia.org/r/886373 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:22:20] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) I'm not certain - it looks like the results of `sum(process_open_fds{kubernetes_namespace="thumbor"})` aren't representative of... [13:34:54] (03PS2) 10Nicolas Fraison: feat(kerberos): add feature to rely on tcp first to query kdc [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) [13:36:06] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39667/console" [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [13:37:51] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39668/console" [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [13:38:31] (03CR) 10Nicolas Fraison: [V: 03+1] feat(kerberos): add feature to rely on tcp first to query kdc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [13:38:48] (03CR) 10Nicolas Fraison: [V: 03+1] feat(kerberos): add feature to rely on tcp first to query kdc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [13:39:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM in general, some small comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:40:15] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39669/console" [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [13:42:22] (03PS4) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) [13:42:33] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [13:43:21] (03PS5) 10Nicolas Fraison: perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) [13:43:23] (03PS5) 10Nicolas Fraison: perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) [13:43:25] (03PS6) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [13:44:23] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39670/console" [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [13:45:51] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39671/console" [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [13:46:04] (03CR) 10David Caro: [C: 03+2] puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [13:50:10] (03CR) 10David Caro: node_pinger: use jumbo frames (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:58:36] (03CR) 10Cathal Mooney: [C: 03+2] Default L2 interfaces to MTU 9212 if not set from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [13:59:09] (03Merged) 10jenkins-bot: Default L2 interfaces to MTU 9212 if not set from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1400). [14:00:05] Superpes and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] hello [14:00:17] Hi :) [14:00:40] * TheresNoTime can deploy in about 10m if no one else appears [14:00:51] I can deploy [14:01:02] go for it taavi, thank you :) [14:01:24] great, thanks taavi [14:01:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889762 (https://phabricator.wikimedia.org/T329815) (owner: 10Superpes15) [14:02:05] (03Merged) 10jenkins-bot: [simplewiki] Change to 'uca-default-u-kn' category collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889762 (https://phabricator.wikimedia.org/T329815) (owner: 10Superpes15) [14:02:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [14:02:30] !log taavi@deploy1002 Started scap: Backport for [[gerrit:889762|[simplewiki] Change to 'uca-default-u-kn' category collation (T329815)]] [14:02:35] T329815: Change wgCategoryCollation on the Simple English Wikipedia to uca-default-u-kn - https://phabricator.wikimedia.org/T329815 [14:04:01] (03PS3) 10Sergio Gimeno: GrowthExperiments: Enable link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) [14:04:24] !log taavi@deploy1002 superpes and taavi: Backport for [[gerrit:889762|[simplewiki] Change to 'uca-default-u-kn' category collation (T329815)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:05:36] Superpes: please test your patch on mwdebug1001 [14:06:26] 10SRE, 10Infrastructure-Foundations: Migrate Kerberos clients towards TCP - https://phabricator.wikimedia.org/T329839 (10MoritzMuehlenhoff) [14:06:38] taavi: I don't think that's possible without running the updating script [14:06:45] ah, true [14:06:50] so I'll just sync and then run the script? [14:06:52] Uhm Shouldn't be updateCollation.php runned to see the patch working? taavi [14:06:57] (03PS1) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/889805 (https://phabricator.wikimedia.org/T303663) [14:06:58] I think so [14:07:01] Oh lol [14:07:06] taavi: yup, sync and run updateCollation i think [14:07:07] (03CR) 10CI reject: [V: 04-1] puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/889805 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:07:11] sure [14:07:24] (03CR) 10Muehlenhoff: [C: 03+1] "I've filed https://phabricator.wikimedia.org/T329839 to investigate and convert the other Kerberos clients towards TCP." [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [14:09:47] Thanks taavi Let me know if you have an estimate of how long it will take (I don't know how much stuff needs to be updated there) :) [14:10:15] Superpes: the sync to the cluster will finish in a few minutes. I have no clue on how long the updateCollation script will take, at least yet [14:11:23] (03PS2) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/889805 (https://phabricator.wikimedia.org/T303663) [14:12:07] taavi: I suggest it might run for few hours, simplewiki is not small. suggesting running in a tmux/screen. [14:12:14] taavi Yes, of course when you know an an estimate, just to let them know in case it takes a long time :D [14:13:08] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:889762|[simplewiki] Change to 'uca-default-u-kn' category collation (T329815)]] (duration: 10m 38s) [14:13:12] T329815: Change wgCategoryCollation on the Simple English Wikipedia to uca-default-u-kn - https://phabricator.wikimedia.org/T329815 [14:13:35] !log taavi@mwmaint1002:~$ mwscript updateCollation.php --wiki=simplewiki --previous-collation=uppercase | tee T329815.log # T329815 [14:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:53] RECOVERY - Check systemd state on cp4046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:51] Superpes: looks like an hour might be a reasonable estimate [14:14:56] sergi0: yours is up next [14:14:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) (owner: 10Sergio Gimeno) [14:15:09] (03CR) 10David Caro: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889805 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:15:36] Uh ok so no problem :D Thanks for the support @taavi :) Ping me when it ends! [14:15:38] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) (owner: 10Sergio Gimeno) [14:16:01] !log taavi@deploy1002 Started scap: Backport for [[gerrit:888664|GrowthExperiments: Enable link recommendation for 6th round wikis (T304550)]] [14:16:04] taavi: sergi0: Is it intentional that the patch disables link recommendation on cswiki? [14:16:04] (03PS3) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/889805 (https://phabricator.wikimedia.org/T303663) [14:16:04] T304550: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 [14:16:19] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889805 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:16:38] urbanecm: absolutely not, let me check [14:17:21] * urbanecm sees it's just reordering now [14:17:23] cswiki was reordered since I modified csbwiki, but I can see 'cswiki' => true in line 24732 [14:17:50] !log taavi@deploy1002 taavi and sgimeno: Backport for [[gerrit:888664|GrowthExperiments: Enable link recommendation for 6th round wikis (T304550)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:17:53] urbanecm: better double check than sorry. ty! [14:18:02] sorry for the false alarm then. thanks for double checking. [14:18:17] sergi0: please test your patch on mwdebug1001 [14:19:02] taavi: it's a noop, we'll wait for systemd to trigger a maintenace script as part of the checking process. [14:19:18] ok, I'll just sync then [14:22:02] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889805 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:24:21] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:24] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:888664|GrowthExperiments: Enable link recommendation for 6th round wikis (T304550)]] (duration: 09m 23s) [14:25:28] T304550: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 [14:25:29] all done! [14:26:03] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: spicerack dnsdisc.Discovery attempts to query depooled/disabled dns auth servers - https://phabricator.wikimedia.org/T329773 (10Volans) I've amended the above patch to simplify it a bit more given there is no more the need of passi... [14:26:14] Superpes: no, this is going to be much faster. it's about 40% done [14:26:14] (03CR) 10Volans: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/889601 (https://phabricator.wikimedia.org/T329773) (owner: 10ClĂ©ment Goubert) [14:26:24] taavi: great, thank you for the assistance. [14:26:34] Oh :O [14:27:37] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:21] 10SRE, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Traffic: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015 (10LSobanski) [14:41:57] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] feat(kerberos): add feature to rely on tcp first to query kdc [puppet] - 10https://gerrit.wikimedia.org/r/889800 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [14:42:51] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.035e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:43:48] Superpes: the script finished [14:44:14] Uh wow! Thanks for your time and help taavi :D [14:45:44] (03PS6) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) [14:46:32] (03PS5) 10Jcrespo: swift: Create a new read-only role on mw account for backup taking [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) [14:46:53] (03CR) 10Jbond: configmaster: Add a switch to enable the nda subdirectory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [14:47:06] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39676/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:51:19] (03CR) 10Majavah: [C: 04-1] configmaster: Add a switch to enable the nda subdirectory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [15:00:11] (03PS5) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 [15:02:05] (03PS1) 10Jcrespo: swift: Add dummy mediabackup passwords on the same keys as production [labs/private] - 10https://gerrit.wikimedia.org/r/889806 (https://phabricator.wikimedia.org/T269108) [15:02:54] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Enable link recommendation for 6th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888664 (https://phabricator.wikimedia.org/T304550) (owner: 10Sergio Gimeno) [15:03:58] (03PS5) 10Jbond: configmaster: Add a switch to enable the nda subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/889776 [15:04:09] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] swift: Add dummy mediabackup passwords on the same keys as production [labs/private] - 10https://gerrit.wikimedia.org/r/889806 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [15:05:34] (03CR) 10Jbond: [C: 03+2] tlsproxy::localssl: fix cfssl usage [puppet] - 10https://gerrit.wikimedia.org/r/889774 (owner: 10Majavah) [15:05:50] (03CR) 10Majavah: [C: 03+1] configmaster: Add a switch to enable the nda subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [15:06:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39677/console" [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [15:06:40] (03CR) 10Jbond: "thanks upated" [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [15:06:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889568 (https://phabricator.wikimedia.org/T329773) (owner: 10ClĂ©ment Goubert) [15:07:00] (03CR) 10Jbond: [C: 03+2] hieradata: deployment-prep: add certificate_name [puppet] - 10https://gerrit.wikimedia.org/r/889772 (owner: 10Majavah) [15:07:25] taavi: ^^ merged fyi [15:07:52] thanks! [15:09:05] (03PS7) 10ClĂ©ment Goubert: P:spicerack: Add discovery/authdns.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/889568 (https://phabricator.wikimedia.org/T329773) [15:09:40] (03PS1) 10Nicolas Fraison: fix(presto): fix typo from node.enviroment to node.environment [puppet] - 10https://gerrit.wikimedia.org/r/889807 [15:10:07] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/773298/39678/ms-fe1009.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [15:12:27] (03PS1) 10JMeybohm: kubernetes: Continue to use the cergen cert for service-account signing [puppet] - 10https://gerrit.wikimedia.org/r/889808 (https://phabricator.wikimedia.org/T329826) [15:13:48] (03CR) 10ClĂ©ment Goubert: [C: 03+2] P:spicerack: Add discovery/authdns.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/889568 (https://phabricator.wikimedia.org/T329773) (owner: 10ClĂ©ment Goubert) [15:15:07] (03PS8) 10ClĂ©ment Goubert: spicerack: get authdns servers from config file [software/spicerack] - 10https://gerrit.wikimedia.org/r/889601 (https://phabricator.wikimedia.org/T329773) [15:15:43] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Aklapper) @SCherukuwada: Hi, any news on this to share? Thanks :) [15:15:49] (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/889513 [15:17:00] (03CR) 10ClĂ©ment Goubert: [C: 03+1] "LGTM, this will make dnsdisc and service way less brittle. Thanks for the rewrite!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/889601 (https://phabricator.wikimedia.org/T329773) (owner: 10ClĂ©ment Goubert) [15:17:14] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39679/console" [puppet] - 10https://gerrit.wikimedia.org/r/889808 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:20:14] jouncebot: nowandnext [15:20:14] No deployments scheduled for the next 1 hour(s) and 39 minute(s) [15:20:15] In 1 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1700) [15:20:59] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:02] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/889513 (owner: 10PipelineBot) [15:21:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/889776 (owner: 10Jbond) [15:21:10] (03CR) 10MVernon: [C: 03+1] swift: Create a new read-only role on mw account for backup taking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [15:21:43] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:22:40] jouncebot: nowandnext [15:22:40] No deployments scheduled for the next 1 hour(s) and 37 minute(s) [15:22:40] In 1 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1700) [15:22:46] awesome [15:24:10] (03PS2) 10Ladsgroup: Migrate EventLogging config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) [15:26:19] (03Merged) 10jenkins-bot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/889513 (owner: 10PipelineBot) [15:26:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:30:31] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:32:06] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:34:01] (03CR) 10Krinkle: [C: 03+1] Migrate EventLogging config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [15:35:29] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 1 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:35:30] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:36:44] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:39:25] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:39:33] !log PDU maintenance in rack A1 [15:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:34] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:43:17] (03PS7) 10ClĂ©ment Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 [15:43:42] (03PS4) 10ClĂ©ment Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 [15:45:49] (03PS7) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) [15:46:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39682/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:47:32] (03CR) 10Ssingh: [V: 03+1] "revised the patch to say *and* instead of *or*. PCC confirms NOOP on all hosts using the dnsrecursor module." [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:48:09] (03PS15) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [15:52:52] (03PS3) 10Ladsgroup: Migrate EventLogging config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) [15:53:33] (03CR) 10Ladsgroup: "I didn't find anything touching EventLogging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [15:53:39] (03CR) 10Ladsgroup: [C: 03+2] Migrate EventLogging config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [15:54:19] (03Merged) 10jenkins-bot: Migrate EventLogging config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [15:54:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST gateways) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:55] 10SRE, 10Infrastructure-Foundations, 10Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890 (10Jgreen) 05Open→03Declined Neon! Ancient task, probably no longer relevant. [16:02:59] (03PS2) 10Ssingh: P:dns::recursor: set webserver port to 9199 [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) [16:04:10] !log ladsgroup@deploy1002 Synchronized wmf-config/ext-EventLogging.php: Move EventLogging settings from IS.php to ext-EventLogging.php, part I (T308932) (duration: 07m 05s) [16:04:15] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [16:08:29] (03PS3) 10Ssingh: P:dns::recursor: set webserver port to 9199 [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) [16:08:51] (03CR) 10CI reject: [V: 04-1] P:dns::recursor: set webserver port to 9199 [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:11:07] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:11:19] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:11:33] !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Move EventLogging settings from IS.php to ext-EventLogging.php, part II (T308932) (duration: 06m 48s) [16:11:36] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [16:14:56] (03PS1) 10Muehlenhoff: Enable command_broadcast to the new puppetdb 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/889817 (https://phabricator.wikimedia.org/T321783) [16:16:43] (03PS1) 10Krinkle: tests: Improve diffConfig by sorting keys first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889819 [16:17:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889817 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [16:18:56] (03PS4) 10Ssingh: P:dns::recursor: set webserver port to 9199 [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) [16:19:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:19:25] !log installing net-snmp security updates on Buster [16:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39685/console" [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:21:00] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move EventLogging settings from IS.php to ext-EventLogging.php, part III (T308932) (duration: 06m 54s) [16:21:04] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [16:21:36] (03CR) 10Ladsgroup: [C: 03+2] tests: Improve diffConfig by sorting keys first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889819 (owner: 10Krinkle) [16:22:18] (03Merged) 10jenkins-bot: tests: Improve diffConfig by sorting keys first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889819 (owner: 10Krinkle) [16:24:50] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:26:03] (03PS1) 10Nicolas Fraison: fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) [16:27:40] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39686/console" [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [16:28:04] (03CR) 10CI reject: [V: 04-1] fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [16:29:11] (03PS2) 10Nicolas Fraison: fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) [16:29:34] (03CR) 10Jelto: [C: 03+1] "lgtm, DNS entry for gerrit.devtools.wmcloud.org with same address exists in devtools project too." [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [16:30:27] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39687/console" [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [16:30:38] (03CR) 10Nicolas Fraison: "While I understand the issue from the puppet code I don't really understand why it is happening now." [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [16:31:19] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [16:32:38] (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/889288/39688/" [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:33:05] (03CR) 10Ssingh: [V: 03+1] "NOOP on all existing hosts, which is expected anyway, since it's the profile." [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:33:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:34:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:31] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::recursor: set webserver port to 9199 [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:36:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:36:56] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39689/console" [puppet] - 10https://gerrit.wikimedia.org/r/889580 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [16:42:54] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] chore(presto): remove useless gc tag PrintGCApplicationConcurrentTime [puppet] - 10https://gerrit.wikimedia.org/r/889580 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [16:44:51] (03CR) 10BCornwall: [C: 03+2] varnish: Remove upload.wm.o test from text test [puppet] - 10https://gerrit.wikimedia.org/r/886840 (https://phabricator.wikimedia.org/T262996) (owner: 10Vgutierrez) [16:48:01] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Jhancock.wm) a:03Papaul [16:51:17] (03CR) 10Jbond: [C: 03+2] "fyi feel free to just merge changes to the private repo" [labs/private] - 10https://gerrit.wikimedia.org/r/889798 (owner: 10Slyngshede) [16:51:24] (03CR) 10Jbond: [C: 03+1] P:IDM secrets are mapped wrong. [labs/private] - 10https://gerrit.wikimedia.org/r/889798 (owner: 10Slyngshede) [16:53:49] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:53:59] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:54:03] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:54:17] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:54:27] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:54:43] PROBLEM - Host msw1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:55:21] !log Deployed refinery-source change to remove Github.io from Mediasites definition of referees. [16:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:44] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) [16:56:51] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Traffic, and 2 others: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [16:57:06] Alerts above are PDU maintenance in A1 [16:57:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:58:20] (03CR) 10Jbond: "I think this needs more investigation. the intermediated should be added to the p12 bundle via the `certfile => $certs['ca']` line. whic" [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [16:58:33] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] fix(presto): do not set query.max*per-node config on coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888685 (owner: 10Nicolas Fraison) [16:58:36] (03CR) 10Jbond: [C: 04-1] fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [17:00:05] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:13] (03CR) 10Nicolas Fraison: fix(presto): create pkcs12 server file with intermediate certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [17:04:33] RECOVERY - Host msw1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms [17:04:42] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) Apparently (not 100% sure), the new account gives me GET permissions on the public containers, but not on the deleted ones: ` ✔ root@ms-fe1009:~... [17:04:45] PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:04:45] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:04:45] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:04:45] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:04:45] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:04:46] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:04:46] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:04:47] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [17:05:15] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [17:05:17] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.50 ms [17:05:20] (03CR) 10Nicolas Fraison: fix(presto): create pkcs12 server file with intermediate certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [17:05:38] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [17:05:39] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [17:06:12] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [17:06:42] (03PS3) 10Nicolas Fraison: fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) [17:07:14] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [17:07:39] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [17:08:34] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39690/console" [puppet] - 10https://gerrit.wikimedia.org/r/889580 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [17:09:01] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:47] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:09:57] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:10:19] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:10:43] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:11:03] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39691/console" [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [17:11:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) (owner: 10ClĂ©ment Goubert) [17:11:47] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:13:03] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:15:11] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [17:15:17] PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:15:17] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:15:17] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:15:17] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:15:17] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:15:17] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:15:18] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:15:21] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [17:15:29] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:15:39] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms [17:15:43] (03CR) 10Ladsgroup: [C: 03+1] sre.switchdc.mediawiki: Remove ACTIVE_ACTIVE_SECTIONS [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) (owner: 10ClĂ©ment Goubert) [17:15:51] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [17:16:39] (03CR) 10ClĂ©ment Goubert: "Holding merge until the spicerack release including the dependency commit." [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) (owner: 10ClĂ©ment Goubert) [17:16:45] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:21:04] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: JVM upgrades - elukey@cumin1001 [17:22:45] PROBLEM - IPMI Sensor Status on gitlab2002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:25:00] (03CR) 10Elukey: [C: 03+1] kubernetes: Continue to use the cergen cert for service-account signing [puppet] - 10https://gerrit.wikimedia.org/r/889808 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [17:25:12] !log PDU maintenance in rack A1 complete [17:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:31] !log PDU maintenance in rack A8 [17:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:31] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:29:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST gateways) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:30:27] PROBLEM - IPMI Sensor Status on ml-serve2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:31:49] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) @Dzahn techblog.wikimedia.org should be good. Could you confirm? I am looking into: diff.wikimedia.org wikimediaendowment.org one.wikimedia.org [17:32:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:33:01] jouncebot: nowandnext [17:33:01] For the next 0 hour(s) and 26 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1700) [17:33:02] In 0 hour(s) and 26 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1800) [17:33:02] In 0 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1800) [17:36:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST gateways) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:36:50] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39692/console" [puppet] - 10https://gerrit.wikimedia.org/r/889807 (owner: 10Nicolas Fraison) [17:38:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: JVM upgrades - elukey@cumin1001 [17:41:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST gateways) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:42:19] PROBLEM - MariaDB Replica SQL: s4 on db2106 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:42:19] PROBLEM - MariaDB read only s4 on db2106 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:42:35] PROBLEM - MariaDB Replica IO: s4 on db2106 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:42:55] PROBLEM - mysqld processes on db2106 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:42:55] (03CR) 10Jbond: fix(presto): create pkcs12 server file with intermediate certificate (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [17:43:53] (03CR) 10Volans: [C: 03+1] "Looks sane to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10ClĂ©ment Goubert) [17:44:10] what's the issue with db2106? a crash? [17:45:10] ^ Amir1 I will depool it [17:45:14] jynus: https://phabricator.wikimedia.org/T327404 probably? [17:45:20] oh thanks [17:45:36] yeah A8 [17:45:45] that's a full server crash [17:45:53] not a mangament issue [17:45:58] ok, sorry for the noise then :) [17:46:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlocalstorage10[01-03] - https://phabricator.wikimedia.org/T329863 (10RobH) [17:47:04] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db2106', diff saved to https://phabricator.wikimedia.org/P44678 and previous config saved to /var/cache/conftool/dbconfig/20230216-174704-jynus.json [17:47:07] jynus: let me know if you want me to do anything [17:47:12] (03PS8) 10ClĂ©ment Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 [17:47:19] Amir1: if not busy, please create a task [17:47:27] sure [17:47:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlocalstorage10[01-03] - https://phabricator.wikimedia.org/T329863 (10RobH) [17:47:39] and when more available investigate why, etc [17:48:07] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) [17:48:29] T329864 [17:48:29] T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864 [17:49:06] if everything is good, https://logstash.wikimedia.org/goto/864976614784d20ecc879e30a8b0a8cb should stop [17:49:22] Amir1: please double check on that and then we are good [17:50:06] also downtime the server for a couple of days [17:51:31] (03CR) 10Volans: "Question inline for correctness, looks good otherwise (I didn't try the formatting options)." [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10ClĂ©ment Goubert) [17:51:41] yeah [17:52:08] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10ClĂ©ment Goubert) [17:52:45] db2146 is weird, too [17:53:10] but look up [17:53:35] RECOVERY - IPMI Sensor Status on gitlab2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:54:53] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10RobH) [17:55:13] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10RobH) [17:55:23] PROBLEM - MariaDB Replica Lag: s4 on db2106 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:55:48] Amir1: did you downtime it? ^ [17:56:08] not yet [17:56:09] on it [17:56:33] I hope it doesn't page [17:57:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2106.codfw.wmnet with reason: DB crashed T329864 [17:57:20] T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864 [17:57:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2106.codfw.wmnet with reason: DB crashed T329864 [17:58:15] PROBLEM - IPMI Sensor Status on parse2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:59:16] jynus: it should me because of the maintenance [17:59:45] (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-02-16-121721-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889832 [18:00:04] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1800). [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1800) [18:00:46] o/ I'll be pushing out a new build of Toolhub in today's window. [18:01:11] RECOVERY - IPMI Sensor Status on ml-serve2005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:02:28] it seems likely given the blip and restart [18:02:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889817 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [18:03:05] RECOVERY - mysqld processes on db2106 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:03:17] (03PS2) 10BryanDavis: toolhub: Bump container version to 2023-02-16-121721-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889832 (https://phabricator.wikimedia.org/T287179) [18:04:39] ^ Amir1 I guess you restarted the process. Let's save some time tomorrow to talk about data checking? [18:04:53] RECOVERY - MariaDB Replica IO: s4 on db2106 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:04:53] RECOVERY - MariaDB Replica SQL: s4 on db2106 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:04:53] RECOVERY - MariaDB read only s4 on db2106 is OK: Version 10.4.25-MariaDB-log, Uptime 112s, read_only: True, event_scheduler: True, 2975.24 QPS, connection latency: 0.005369s, query latency: 0.000549s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [18:04:53] yup [18:04:59] oh, you are out [18:05:06] monday then [18:05:07] ah, I forgot [18:05:14] yeah, sounds good [18:05:14] not in a hurry [18:05:21] Thanks! [18:06:57] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:07:18] but give it if you can 4 or 5 days of downtimes instead, to extend it over the weekend [18:07:32] sure [18:08:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on db2106.codfw.wmnet with reason: DB crashed T329864 [18:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on db2106.codfw.wmnet with reason: DB crashed T329864 [18:08:07] with memory issues, for example, it kept crashing until servicing, so we want to avoid getting paged over the weekend :-D [18:08:08] T329864: db2106 crashed - https://phabricator.wikimedia.org/T329864 [18:11:57] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:11] RECOVERY - MariaDB Replica Lag: s4 on db2106 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:16:04] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-02-16-121721-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889832 (https://phabricator.wikimedia.org/T287179) (owner: 10BryanDavis) [18:21:02] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-02-16-121721-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889832 (https://phabricator.wikimedia.org/T287179) (owner: 10BryanDavis) [18:21:43] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [18:22:30] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [18:24:07] TheresNoTime: if you're around, can you check on the script from https://phabricator.wikimedia.org/T315510#8618585 ? i wonder how far it has gotten so far [18:24:41] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:25:53] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:26:55] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:28:18] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:29:03] RECOVERY - IPMI Sensor Status on parse2005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:30:42] MatmaRex: it says it's processed 753400 out of 2843251 rows [18:31:16] taavi: thanks! [18:32:03] (that's probably a lot closer to done than it seems, there probably are a lot fewer rows than that. when i was testing locally, the estimates didn't consider some conditions) [18:35:15] (Nonwrite HTTP requests with primary DB connections alert) firing: Nonwrite HTTP requests with primary DB connections alert - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [18:37:41] !log killed webrequest oozie bundle to deploy refinery changes. [18:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:28] !log ebysans@deploy1002 Started deploy [analytics/refinery@0f1a930]: Regular analytics weekly train [analytics/refinery@0f1a930] [18:47:23] RECOVERY - Host ps1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.58 ms [18:51:04] (03PS1) 10Majavah: toolforge: Drop RBAC rules for deprecated resources [puppet] - 10https://gerrit.wikimedia.org/r/889836 (https://phabricator.wikimedia.org/T329869) [18:51:21] PROBLEM - ps1-a8-codfw-infeed-load-tower-A-phase-Z on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:51:21] PROBLEM - ps1-a8-codfw-infeed-load-tower-B-phase-Y on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:51:49] PROBLEM - ps1-a8-codfw-infeed-load-tower-B-phase-X on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:51:49] PROBLEM - ps1-a8-codfw-infeed-load-tower-A-phase-X on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:51:57] PROBLEM - ps1-a8-codfw-infeed-load-tower-A-phase-Y on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:51:59] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Dzahn) Hi @Sbenchagra, thank you very much for that. Yes, I can confirm generally all 4 have HSTS headers now/meanwhile. That is great! There is some detail t... [18:52:19] PROBLEM - ps1-a8-codfw-infeed-load-tower-B-phase-Z on ps1-a8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:39] !log ebysans@deploy1002 Finished deploy [analytics/refinery@0f1a930]: Regular analytics weekly train [analytics/refinery@0f1a930] (duration: 07m 11s) [18:53:30] (03Abandoned) 10Majavah: P:configmaster:: add conditional for abuse_nets link [puppet] - 10https://gerrit.wikimedia.org/r/889632 (owner: 10Majavah) [18:54:10] (03PS3) 10Ryan Kemper: wdqs: no longer page on failed probe [puppet] - 10https://gerrit.wikimedia.org/r/889662 (https://phabricator.wikimedia.org/T325324) [18:54:10] !log ebysans@deploy1002 Started deploy [analytics/refinery@0f1a930] (thin): Regular analytics weekly train THIN [analytics/refinery@0f1a930] [18:54:18] !log ebysans@deploy1002 Finished deploy [analytics/refinery@0f1a930] (thin): Regular analytics weekly train THIN [analytics/refinery@0f1a930] (duration: 00m 07s) [18:54:21] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889662 (https://phabricator.wikimedia.org/T325324) (owner: 10Ryan Kemper) [18:54:27] ebysans: Lemme know when you're done. I have a new scap release to deploy. [18:54:42] !log ebysans@deploy1002 Started deploy [analytics/refinery@0f1a930] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0f1a930] [18:55:13] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Dzahn) @BCornwall I realize "Traffic-Icebox" has been removed but consulting input from traffic would still be valuable for this one. Should we close it as reso... [18:56:05] !log ebysans@deploy1002 Finished deploy [analytics/refinery@0f1a930] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0f1a930] (duration: 01m 23s) [18:57:25] (03CR) 10Dzahn: [C: 03+2] "oh, sorry and thanks for the fix! this happened to me on basically every one of them that the "v" slipped in there automatically." [puppet] - 10https://gerrit.wikimedia.org/r/889767 (https://phabricator.wikimedia.org/T327977) (owner: 10ClĂ©ment Goubert) [18:59:25] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10CKoerner_WMF) [19:00:04] dduvall and ^demon: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T1900). [19:01:29] (03CR) 10Dzahn: [C: 03+2] "yes, this fixed the puppet run on planet*, thanks a lot claime!" [puppet] - 10https://gerrit.wikimedia.org/r/889767 (https://phabricator.wikimedia.org/T327977) (owner: 10ClĂ©ment Goubert) [19:02:14] (03CR) 10Dzahn: [C: 03+1] "yes, I just added that new DNS entry the other day in Horizon. thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [19:03:09] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) @Dzahn techblog should now have "includeSubdomains;preload" [19:03:14] (03CR) 10Dzahn: [C: 03+1] "it's just that the VM is currently shut down for other reasons, so will apply later" [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [19:03:27] !log dancy@deploy1002 Installing scap version "4.36.0" for 564 hosts [19:03:54] !log dancy@deploy1002 Installation of scap version "4.36.0" completed for 564 hosts [19:09:44] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) [19:12:50] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) [19:14:38] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) @Dzahn Happy to give consultation where needed but since we don't manage any of the sites I figured we needn't be added. Truthfully, I'm not sure who... [19:15:09] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Aklapper) [19:15:12] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10Aklapper) [19:16:58] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) [19:17:19] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) soundlogo.wikimedia.org is now done! [19:20:37] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889845 (https://phabricator.wikimedia.org/T325586) [19:20:39] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889845 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [19:21:26] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889845 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [19:24:48] (03CR) 10Dzahn: [C: 03+1] Add 'vro' as alias for 'fiu-vro' [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix) [19:25:15] (Nonwrite HTTP requests with primary DB connections alert) firing: (2) Nonwrite HTTP requests with primary DB connections alert - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [19:27:25] (03PS1) 10BCornwall: varnish: Check upload.wm.o for analytics cookies [puppet] - 10https://gerrit.wikimedia.org/r/889846 (https://phabricator.wikimedia.org/T262996) [19:28:48] (03PS2) 10Dzahn: integration: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884395 (https://phabricator.wikimedia.org/T327972) [19:29:06] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.23 refs T325586 [19:29:10] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [19:30:45] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 5486 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:31:08] (03PS3) 10Dzahn: integration: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884395 (https://phabricator.wikimedia.org/T327972) [19:31:28] (03CR) 10Dzahn: [C: 03+2] integration: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884395 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [19:33:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/884395/39695/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/884395 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn) [19:38:11] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) [19:39:18] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889662 (https://phabricator.wikimedia.org/T325324) (owner: 10Ryan Kemper) [19:39:30] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: no longer page on failed probe [puppet] - 10https://gerrit.wikimedia.org/r/889662 (https://phabricator.wikimedia.org/T325324) (owner: 10Ryan Kemper) [19:40:59] (03CR) 10Volans: "The python logic is ok but as the lists of services are slightly different I'll leave this to the serviceops team to check." [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10ClĂ©ment Goubert) [19:43:28] (03PS1) 10Raymond Ndibe: puppet: replica_cnf functional test fix [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) [19:43:48] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) Hi, @Sbenchagra, thanks for doing this. I'm still unable to confirm that soundlogo has `;includeSubdomains;preload` in the header. Maybe... [19:44:32] (03PS1) 10Ryan Kemper: wdqs: don't page for wdqs-heavy or wdqs-ssl [puppet] - 10https://gerrit.wikimedia.org/r/889852 (https://phabricator.wikimedia.org/T325324) [19:45:05] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889852 (https://phabricator.wikimedia.org/T325324) (owner: 10Ryan Kemper) [19:45:15] (Nonwrite HTTP requests with primary DB connections alert) resolved: Nonwrite HTTP requests with primary DB connections alert - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [19:45:25] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889852 (https://phabricator.wikimedia.org/T325324) (owner: 10Ryan Kemper) [19:47:51] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: don't page for wdqs-heavy or wdqs-ssl [puppet] - 10https://gerrit.wikimedia.org/r/889852 (https://phabricator.wikimedia.org/T325324) (owner: 10Ryan Kemper) [19:49:43] 10SRE-OnFire, 10Discovery-Search (Current work), 10Patch-For-Review, 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10Gehel) a:03RKemper [19:49:46] (03CR) 10BCornwall: "Not sure if it's appropriate to add that check in this file... let me know if it should be in its own file. This seems the most appropriat" [puppet] - 10https://gerrit.wikimedia.org/r/889846 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [19:50:04] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) Hi @BCornwall, Not a forgotten "save" click. There might be a delay, since it was done about 30 minutes ago. [19:50:57] (03CR) 10BCornwall: "```" [puppet] - 10https://gerrit.wikimedia.org/r/889846 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [19:51:20] (03CR) 10Raymond Ndibe: "test now passes on both stretch and debian 🎉" [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [19:51:22] (03CR) 10BCornwall: [V: 03+1] varnish: Check upload.wm.o for analytics cookies [puppet] - 10https://gerrit.wikimedia.org/r/889846 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [19:51:46] (03CR) 10Majavah: [C: 04-1] "-1 due to the missing $ in the code." [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [19:51:52] (03CR) 10Raymond Ndibe: "** stretch and bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [19:55:05] (03PS2) 10Raymond Ndibe: replica_cnf_api_test: check if user with id USER_ID exists [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) [19:55:18] (03CR) 10Raymond Ndibe: replica_cnf_api_test: check if user with id USER_ID exists (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [19:55:44] (03CR) 10Raymond Ndibe: "Fixed you can check again if you are still around" [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [19:57:02] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10BCornwall) [19:57:59] (03CR) 10Dzahn: "nice! I looked at the puppet run and I see the error around scap, puppet run finishes so this isn't a big deal, but looks like we need to " [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [20:06:18] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) Yep, seems to have been a delay. It's active now! Thanks for doing all that, @Sbenchagra Hate to be a pest, but would you also be willi... [20:10:01] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) Great! No worries @BCornwall! Let me get back to you on that. Thank you [20:10:31] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) a:03Sbenchagra [20:13:01] (03PS2) 10Ryan Kemper: [WIP] wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) [20:13:33] (03PS3) 10Ryan Kemper: wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) [20:19:32] 10SRE, 10Traffic: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10BCornwall) 05Open→03Resolved a:03BCornwall I can confirm that the stack nowadays happily works the same with H1 and H2. A 414 is also returned using the hyper module as Valentin did above. [20:21:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Jhancock.wm) a:05Jhancock.wm→03Papaul [20:28:23] 10SRE, 10SRE Observability, 10Traffic: Investigate cp5006 crash - https://phabricator.wikimedia.org/T292506 (10BCornwall) 05Open→03Invalid @Vgutierrez, @BBlack, @ssingh This task seems quite too old to really act upon since the context is likely lost (not to mention that ats-tls isn't used any more). I d... [20:43:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Jhancock.wm) a:03Papaul [20:48:19] PROBLEM - Disk space on dumpsdata1001 is CRITICAL: DISK CRITICAL - free space: /data 894014 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [21:00:04] brennen and TheresNoTime: OwO what's this, a deployment window?? UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T2100). nyaa~ [21:00:04] danisztls: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] o/ [21:00:29] I can deploy :) [21:01:05] (03PS3) 10Samtar: Remove Research Incentive survey from swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865740 (https://phabricator.wikimedia.org/T321252) (owner: 10DDesouza) [21:01:32] thx TheresNoTime. [21:02:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865740 (https://phabricator.wikimedia.org/T321252) (owner: 10DDesouza) [21:02:55] (03Merged) 10jenkins-bot: Remove Research Incentive survey from swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865740 (https://phabricator.wikimedia.org/T321252) (owner: 10DDesouza) [21:03:11] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865740|Remove Research Incentive survey from swwiki (T321252)]] [21:03:15] T321252: Deploy Research Incentive Survey on Swahili Wikipedia - https://phabricator.wikimedia.org/T321252 [21:04:49] !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:865740|Remove Research Incentive survey from swwiki (T321252)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:05:06] danisztls: that's live on mwdebug, can you test? [21:05:30] TheresNoTime: lgtm [21:05:36] syncing [21:06:15] TheresNoTime: thank you [21:09:19] !log Added new field referer_data to wmf.webrequest table using the alter table statement [21:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:24] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865740|Remove Research Incentive survey from swwiki (T321252)]] (duration: 08m 13s) [21:11:29] T321252: Deploy Research Incentive Survey on Swahili Wikipedia - https://phabricator.wikimedia.org/T321252 [21:11:38] danisztls: that's live :) [21:15:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Jhancock.wm) a:05Jhancock.wm→03Papaul [21:24:05] !log close UTC late backport window [21:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:29] (03Abandoned) 10DDesouza: Deploy Research Incentive survey on yowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865744 (https://phabricator.wikimedia.org/T321249) (owner: 10DDesouza) [21:31:45] (03CR) 10Volans: "Great, this looks pretty good. I've left a couple of comments and a question inlie." [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [21:32:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:34:29] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 (owner: 10Jbond) [21:43:16] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39696/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [21:43:36] jouncebot: nowandnext [21:43:36] For the next 0 hour(s) and 16 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T2100) [21:43:36] In 9 hour(s) and 16 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230217T0700) [21:43:55] grabbing the conch to roll out an apache config change [21:56:30] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Remove aliases 'minnan' and 'zh-cfr' [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [22:04:22] (03CR) 10BryanDavis: "Is this a better fix for T254636 than I0b9322632b2ac511661a46373431927708089d7b?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/889769 (https://phabricator.wikimedia.org/T320178) (owner: 10Arturo Borrero Gonzalez) [22:06:09] (03CR) 10BryanDavis: bullseye-sssd/: add mysql client command line utility (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/889769 (https://phabricator.wikimedia.org/T320178) (owner: 10Arturo Borrero Gonzalez) [22:06:47] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [22:07:31] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [22:07:32] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [22:07:33] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2013-2014,2019,2021,2024].codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [22:08:12] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [22:08:53] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [22:09:41] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [22:09:42] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [22:10:48] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [22:10:49] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:11:34] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:11:35] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:12:15] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:12:23] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10matmarex) 05Open→03Resolved I can reliably access the pages no... [22:12:23] !log rzl@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [22:12:23] !log rzl@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [22:12:43] !log rzl@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [22:12:43] !log rzl@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [22:13:04] !log rzl@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [22:13:04] !log rzl@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [22:13:14] (03CR) 10Majavah: [C: 03+1] "This seems like a reasonable idea." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [22:13:26] !log rzl@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [22:13:27] !log rzl@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [22:13:28] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [22:15:04] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [22:15:05] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [22:15:29] (03CR) 10Majavah: [C: 03+1] "These are small and generally useful tools to any kind of app, so I'm fine with including those in the base image." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842992 (https://phabricator.wikimedia.org/T294607) (owner: 10BryanDavis) [22:15:47] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [22:18:51] 10SRE, 10Observability-Alerting, 10Traffic: Move (or delete?) trafficserver restart count alert from icinga to alerts.git - https://phabricator.wikimedia.org/T327791 (10BCornwall) a:03BCornwall Forgive me if I'm off base but hasn't this already been done with T300723? We merged in https://gerrit.wikimedia.... [22:20:22] done! [22:21:35] (03PS1) 10BCornwall: trafficserver: Remove restart count icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/889881 (https://phabricator.wikimedia.org/T300723) [22:30:00] (03PS2) 10BCornwall: trafficserver: Remove restart count icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/889881 (https://phabricator.wikimedia.org/T300723) [22:31:41] 10SRE, 10Observability-Alerting, 10Traffic: Move (or delete?) trafficserver restart count alert from icinga to alerts.git - https://phabricator.wikimedia.org/T327791 (10BCornwall) I see, I had forgotten to remove it from puppet. I've created https://gerrit.wikimedia.org/r/889881 to address that. [22:32:48] (03CR) 10Volans: [C: 03+1] "LGTM, one question and couple of typos inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [22:37:49] 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10BCornwall) Since 9.1.4 seems to be active on all cp hosts, should this and T325726 be closed? [22:37:54] (03CR) 10Volans: [C: 03+1] "This one didn't get merged, but at this point it might be worth waiting for the next Spicerack release with the skip_acked option." [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [22:39:41] (03CR) 10Volans: sre.hosts.reboot-single: add ability to enable host on reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [22:43:34] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BCornwall) I think that part of the effort of getting our board(s) under control is redefining a ticket's anatomy. I believe this ticket is too broad and more appropriately belongs in as OKRs or general plan... [22:44:43] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BCornwall) To clarify, I can understand the usage of encompassing, "epic"-style tickets; I view those as separate functions than orphaned mega-tickets. [22:46:57] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10BCornwall) Seeing as the event has passed and all subtasks are closed, is this ready to be closed? [22:49:46] 10SRE, 10Traffic, 10serviceops: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10BCornwall) Envoy seems to be on 1.18.2 now. Can this be closed, or was there any other deployment need this ticket addresses? [22:50:52] 10SRE, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10BCornwall) a:03BCornwall [23:00:20] (03PS1) 10BCornwall: varnish: Runbook and dashboard for down exporter [alerts] - 10https://gerrit.wikimedia.org/r/889887 (https://phabricator.wikimedia.org/T187708) [23:00:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[2013-2014,2019,2021,2024].codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [23:02:10] (03PS2) 10BCornwall: varnish: Runbook and dashboard for down exporter [alerts] - 10https://gerrit.wikimedia.org/r/889887 (https://phabricator.wikimedia.org/T187708) [23:03:09] 10SRE, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10BCornwall) 05Open→03In progress [23:13:22] 10SRE, 10Traffic: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 (10BCornwall) 05Open→03Resolved a:03BCornwall Looks like this has been rolled out to all cp nodes. Great job! [23:13:31] 10SRE, 10Traffic: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 (10BCornwall) [23:38:42] (03PS1) 10BCornwall: Remove FLoC headers [puppet] - 10https://gerrit.wikimedia.org/r/889892 [23:40:53] (03PS2) 10BCornwall: Remove FLoC headers [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) [23:56:57] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:58:25] (03Abandoned) 10Cwhite: profile: apply ipsec monitoring where enabled with ipsec_exporter [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite)