[00:03:25] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989857 [00:38:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989857 (owner: 10TrainBranchBot) [00:59:52] !log LDAP - added myself to gerritadmin group [00:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:42] (03CR) 10Dzahn: [C: 03+2] research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989965 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [01:02:53] (03Merged) 10jenkins-bot: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/989965 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [01:09:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:09:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/989857 (owner: 10TrainBranchBot) [01:14:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:21:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:26:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:30:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:40:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:21:38] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [02:22:21] (03PS16) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [02:24:12] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:12] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:12] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:18] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:12] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:49] urbanecm: (when you are around) did the SecurePoll patch get backported in the end? [03:54:09] PROBLEM - Hadoop NodeManager on an-worker1144 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:55:13] PROBLEM - Check systemd state on an-worker1144 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:21] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [03:57:43] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:58:13] PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:21] PROBLEM - Check systemd state on an-worker1092 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:29] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:04:29] RECOVERY - Check systemd state on an-worker1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:35] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:04:39] RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:07] RECOVERY - Hadoop NodeManager on an-worker1144 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:11:08] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [04:11:31] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [04:11:43] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [04:12:11] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [04:12:17] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [04:12:38] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [04:16:59] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:17:01] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:47] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:18:49] PROBLEM - Check systemd state on an-worker1154 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:19:33] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:20:01] RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:51] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:32:51] RECOVERY - Check systemd state on an-worker1154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:41] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:35:43] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:07] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:55:19] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:23] (03CR) 10KartikMistry: Update cxserver to 2023-12-04-083437-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [05:52:47] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:21] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:58:05] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:58:11] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:25] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:07:33] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:10:24] (03CR) 10Marostegui: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn) [06:11:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:12:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:21:35] (03PS1) 10Marostegui: db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989985 (https://phabricator.wikimedia.org/T354506) [06:21:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1168 T354506', diff saved to https://phabricator.wikimedia.org/P54660 and previous config saved to /var/cache/conftool/dbconfig/20240112-062137-marostegui.json [06:21:42] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [06:23:05] (03CR) 10Marostegui: [C: 03+2] db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/989985 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:23:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1168.eqiad.wmnet with OS bookworm [06:25:16] (03PS1) 10Marostegui: installserver: Do not reimage db1248 [puppet] - 10https://gerrit.wikimedia.org/r/989987 [06:31:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [06:32:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [06:32:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:32:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:32:37] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db1248 [puppet] - 10https://gerrit.wikimedia.org/r/989987 (owner: 10Marostegui) [06:32:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2117 (T354336)', diff saved to https://phabricator.wikimedia.org/P54661 and previous config saved to /var/cache/conftool/dbconfig/20240112-063239-marostegui.json [06:32:58] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:34:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T354336)', diff saved to https://phabricator.wikimedia.org/P54662 and previous config saved to /var/cache/conftool/dbconfig/20240112-063456-marostegui.json [06:35:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1168.eqiad.wmnet with reason: host reimage [06:38:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1168.eqiad.wmnet with reason: host reimage [06:41:22] (03CR) 10Marostegui: "This shouldn't be needed, the entry for codfw proxy is already there." [puppet] - 10https://gerrit.wikimedia.org/r/989536 (owner: 10Dzahn) [06:45:19] (ProbeDown) firing: (4) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:50:02] (03PS1) 10Marostegui: Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989940 [06:50:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P54663 and previous config saved to /var/cache/conftool/dbconfig/20240112-065002-marostegui.json [06:54:14] (03PS1) 10Andrea Denisse: pontoon: Enroll pontoon-grafana-02 [puppet] - 10https://gerrit.wikimedia.org/r/989989 (https://phabricator.wikimedia.org/T352665) [06:57:12] (03CR) 10Marostegui: [C: 03+2] Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/989940 (owner: 10Marostegui) [06:59:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1168.eqiad.wmnet with OS bookworm [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240112T0700) [07:05:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P54664 and previous config saved to /var/cache/conftool/dbconfig/20240112-070508-marostegui.json [07:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54665 and previous config saved to /var/cache/conftool/dbconfig/20240112-070807-root.json [07:09:13] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:17] (03Abandoned) 10Muehlenhoff: Make cumin1002 a DB admin host [puppet] - 10https://gerrit.wikimedia.org/r/983169 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [07:20:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T354336)', diff saved to https://phabricator.wikimedia.org/P54666 and previous config saved to /var/cache/conftool/dbconfig/20240112-072015-marostegui.json [07:20:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [07:20:19] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:20:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [07:20:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T354336)', diff saved to https://phabricator.wikimedia.org/P54667 and previous config saved to /var/cache/conftool/dbconfig/20240112-072038-marostegui.json [07:22:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T354336)', diff saved to https://phabricator.wikimedia.org/P54668 and previous config saved to /var/cache/conftool/dbconfig/20240112-072255-marostegui.json [07:23:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54669 and previous config saved to /var/cache/conftool/dbconfig/20240112-072312-root.json [07:37:06] (03PS1) 10Muehlenhoff: Stop using DSA keys also for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/989993 (https://phabricator.wikimedia.org/T177371) [07:38:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P54670 and previous config saved to /var/cache/conftool/dbconfig/20240112-073802-marostegui.json [07:38:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54671 and previous config saved to /var/cache/conftool/dbconfig/20240112-073817-root.json [07:40:27] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371 (10MoritzMuehlenhoff) The timeline to remove DSA support in OpenSSH has now been announced: https://lists.mindrot.org/pipermail/openssh-unix-announce/2024-January/0... [07:40:49] (03Abandoned) 10Muehlenhoff: Stop using DSA host keys also for cloud vps instances [puppet] - 10https://gerrit.wikimedia.org/r/875306 (https://phabricator.wikimedia.org/T177371) (owner: 10Muehlenhoff) [07:43:11] 10SRE, 10Infrastructure-Foundations: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10MoritzMuehlenhoff) @jcrespo : DB remote access from cumin1002 is now working, can you please take care of moving the DB backups from cumin1001 to cumin1002? There's no rush, but eve... [07:53:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P54672 and previous config saved to /var/cache/conftool/dbconfig/20240112-075309-marostegui.json [07:53:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54673 and previous config saved to /var/cache/conftool/dbconfig/20240112-075322-root.json [07:55:21] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240112T0800) [08:08:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T354336)', diff saved to https://phabricator.wikimedia.org/P54674 and previous config saved to /var/cache/conftool/dbconfig/20240112-080815-marostegui.json [08:08:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [08:08:20] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:08:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54675 and previous config saved to /var/cache/conftool/dbconfig/20240112-080827-root.json [08:08:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [08:08:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2129 (T354336)', diff saved to https://phabricator.wikimedia.org/P54676 and previous config saved to /var/cache/conftool/dbconfig/20240112-080837-marostegui.json [08:10:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T354336)', diff saved to https://phabricator.wikimedia.org/P54677 and previous config saved to /var/cache/conftool/dbconfig/20240112-081055-marostegui.json [08:13:03] (03CR) 10Slyngshede: [C: 03+2] Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede) [08:14:01] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitab-runner to 16.5 [puppet] - 10https://gerrit.wikimedia.org/r/990028 (https://phabricator.wikimedia.org/T354913) [08:14:53] (03Merged) 10jenkins-bot: Temporarily remove RAID MD alerts. [alerts] - 10https://gerrit.wikimedia.org/r/989645 (owner: 10Slyngshede) [08:18:43] (03PS1) 10Peter Fischer: enable page_rerender for 5th batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990029 (https://phabricator.wikimedia.org/T351503) [08:19:21] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 3605 [08:20:01] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3605 [08:23:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54678 and previous config saved to /var/cache/conftool/dbconfig/20240112-082332-root.json [08:24:12] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:24:39] (03CR) 10Jelto: [C: 03+2] aptrepo: upgrade gitlab-ce and gitab-runner to 16.5 [puppet] - 10https://gerrit.wikimedia.org/r/990028 (https://phabricator.wikimedia.org/T354913) (owner: 10Jelto) [08:26:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P54679 and previous config saved to /var/cache/conftool/dbconfig/20240112-082601-marostegui.json [08:26:29] PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:27:09] PROBLEM - Check systemd state on an-worker1152 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:07] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:32:45] RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:33:27] RECOVERY - Check systemd state on an-worker1152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:13] (ProbeDown) firing: (5) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:07] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:38:33] (03PS1) 10Brouberol: spark-history: align Xmx/Xms valuea with amount of requested memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990033 (https://phabricator.wikimedia.org/T354929) [08:38:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P54680 and previous config saved to /var/cache/conftool/dbconfig/20240112-083837-root.json [08:40:11] !log upload and finish upgrade of prometheus 2.48 on all sites - T354399 [08:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:14] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [08:40:56] (03PS2) 10Brouberol: spark-history: align Xmx/Xms valuea with amount of requested memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990033 (https://phabricator.wikimedia.org/T354929) [08:41:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P54681 and previous config saved to /var/cache/conftool/dbconfig/20240112-084108-marostegui.json [08:42:04] (03PS1) 10Ilias Sarantopoulos: WIP:ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) [08:42:13] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) [08:42:19] (03PS3) 10Ilias Sarantopoulos: ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) [08:43:29] (03PS4) 10Ilias Sarantopoulos: ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) [08:56:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T354336)', diff saved to https://phabricator.wikimedia.org/P54682 and previous config saved to /var/cache/conftool/dbconfig/20240112-085614-marostegui.json [08:56:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [08:56:19] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:56:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [08:56:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T354336)', diff saved to https://phabricator.wikimedia.org/P54683 and previous config saved to /var/cache/conftool/dbconfig/20240112-085637-marostegui.json [08:58:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T354336)', diff saved to https://phabricator.wikimedia.org/P54684 and previous config saved to /var/cache/conftool/dbconfig/20240112-085854-marostegui.json [09:09:07] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:14:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P54685 and previous config saved to /var/cache/conftool/dbconfig/20240112-091400-marostegui.json [09:16:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:17:59] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:22:30] (03CR) 10Ayounsi: [C: 03+1] "lgtm with a small suggestion." [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [09:25:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:25:59] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [09:26:50] 10SRE, 10ops-eqiad, 10Observability-Metrics: RAM upgrade for prometheus100[56] - https://phabricator.wikimedia.org/T354684 (10fgiunchedi) [09:29:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P54686 and previous config saved to /var/cache/conftool/dbconfig/20240112-092907-marostegui.json [09:42:02] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) Eh, they're being annoying. 2 options, either we drain the router and we do the switchover as it's impactful. Or, we move the linecard to the other router to see if the issue happens there as well or not. Let... [09:44:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T354336)', diff saved to https://phabricator.wikimedia.org/P54687 and previous config saved to /var/cache/conftool/dbconfig/20240112-094413-marostegui.json [09:44:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [09:44:18] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:44:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [09:44:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:44:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:44:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T354336)', diff saved to https://phabricator.wikimedia.org/P54688 and previous config saved to /var/cache/conftool/dbconfig/20240112-094451-marostegui.json [09:45:07] (03CR) 10Cathal Mooney: [C: 03+2] Add basic validation to Junos config command execution flow (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) (owner: 10Cathal Mooney) [09:46:41] (03PS10) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) [09:47:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T354336)', diff saved to https://phabricator.wikimedia.org/P54689 and previous config saved to /var/cache/conftool/dbconfig/20240112-094708-marostegui.json [09:49:48] (03CR) 10Klausman: ml-services: deploy falcon 7b on GPU (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [09:49:54] (03Merged) 10jenkins-bot: Add basic validation to Junos config command execution flow [cookbooks] - 10https://gerrit.wikimedia.org/r/984642 (https://phabricator.wikimedia.org/T353825) (owner: 10Cathal Mooney) [09:59:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Improve sre.network.configure-switch-interfaces cookbook error-handling - https://phabricator.wikimedia.org/T353825 (10cmooney) 05Open→03Resolved a:03cmooney Patch merged. Hopefully this will help dc-ops spot problems when deployi... [10:02:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P54690 and previous config saved to /var/cache/conftool/dbconfig/20240112-100214-marostegui.json [10:10:28] (03CR) 10Ilias Sarantopoulos: ml-services: deploy falcon 7b on GPU (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [10:11:48] (03CR) 10Dzahn: [C: 03+1] miscweb/microsites: move monitoring of design to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [10:13:15] (03Abandoned) 10Dzahn: mariadb: add mysql grants for phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/989536 (owner: 10Dzahn) [10:13:33] 10SRE, 10Observability-Alerting: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Agreed, good enough to resolve! [10:15:39] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [10:16:07] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877 (owner: 10Cwhite) [10:17:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P54691 and previous config saved to /var/cache/conftool/dbconfig/20240112-101721-marostegui.json [10:20:24] (03CR) 10Filippo Giunchedi: P:puppet::client_bucket Start moving monitoring to Prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:21:50] (03PS1) 10Btullis: Update the openjdk-11 images to match openjdk-8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/990036 [10:22:00] (03CR) 10Filippo Giunchedi: "See inline for SSO probes, the other check LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:24:37] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:25:21] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM overall" [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:28:18] (03CR) 10Btullis: Add base production images containing Java 8 JDK and JRE (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/989786 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [10:32:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T354336)', diff saved to https://phabricator.wikimedia.org/P54692 and previous config saved to /var/cache/conftool/dbconfig/20240112-103227-marostegui.json [10:32:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [10:32:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm +1 on giving this a try, though please merge starting next week in case the rule ends up overloading thanos-rule" [puppet] - 10https://gerrit.wikimedia.org/r/989458 (owner: 10Klausman) [10:32:33] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:32:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [10:32:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54693 and previous config saved to /var/cache/conftool/dbconfig/20240112-103250-marostegui.json [10:33:23] (03CR) 10Btullis: [C: 03+1] spark-history: align Xmx/Xms valuea with amount of requested memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990033 (https://phabricator.wikimedia.org/T354929) (owner: 10Brouberol) [10:33:27] (03CR) 10Filippo Giunchedi: [C: 03+1] amd_rocm Prometheus script: Handle a few new metrics [puppet] - 10https://gerrit.wikimedia.org/r/989833 (owner: 10Klausman) [10:34:32] (03CR) 10Klausman: [V: 03+2 C: 03+2] amd_rocm Prometheus script: Handle a few new metrics [puppet] - 10https://gerrit.wikimedia.org/r/989833 (owner: 10Klausman) [10:34:42] (03PS3) 10Brouberol: spark-history: align Xmx/Xms values with amount of requested memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990033 (https://phabricator.wikimedia.org/T354929) [10:35:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54694 and previous config saved to /var/cache/conftool/dbconfig/20240112-103508-marostegui.json [10:35:11] (03PS6) 10Filippo Giunchedi: thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) [10:35:17] (03CR) 10Filippo Giunchedi: thanos: add bucket query tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [10:41:03] (03CR) 10Btullis: wikireplicas: update-views: always run on all hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) (owner: 10Majavah) [10:50:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P54695 and previous config saved to /var/cache/conftool/dbconfig/20240112-105014-marostegui.json [10:51:12] (03CR) 10Klausman: [C: 03+1] ml-services: deploy falcon 7b on GPU (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [10:53:00] (03CR) 10Klausman: profile::thanos: Remove latency histo bucket filter for istio RR (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989458 (owner: 10Klausman) [10:57:20] (03CR) 10Ilias Sarantopoulos: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [10:58:22] (03Merged) 10jenkins-bot: ml-services: deploy falcon 7b on GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/989913 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [11:01:31] (03PS2) 10Majavah: wikireplicas: update-views: always run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) [11:04:41] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [11:05:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P54696 and previous config saved to /var/cache/conftool/dbconfig/20240112-110521-marostegui.json [11:06:09] (03PS1) 10Muehlenhoff: Update pwstore repo [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/990040 (https://phabricator.wikimedia.org/T353524) [11:07:00] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update pwstore repo [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/990040 (https://phabricator.wikimedia.org/T353524) (owner: 10Muehlenhoff) [11:07:16] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419 (10MoritzMuehlenhoff) [11:08:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Move pwstore repository from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353524 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The repository has been migrated, I've updated the docs/wmf-sre-laptop and an announceme... [11:08:37] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [11:09:13] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:10:55] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:55] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54697 and previous config saved to /var/cache/conftool/dbconfig/20240112-112027-marostegui.json [11:20:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [11:20:32] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:20:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [11:20:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54698 and previous config saved to /var/cache/conftool/dbconfig/20240112-112049-marostegui.json [11:26:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54699 and previous config saved to /var/cache/conftool/dbconfig/20240112-112608-marostegui.json [11:26:13] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:37:23] (03PS15) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [11:38:45] (03PS1) 10Ilias Sarantopoulos: ml-services: increase limitranges for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990044 (https://phabricator.wikimedia.org/T354870) [11:41:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P54700 and previous config saved to /var/cache/conftool/dbconfig/20240112-114114-marostegui.json [11:49:59] (03CR) 10Cathal Mooney: Add automation for management router BGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [11:50:34] (03CR) 10Winston Sung: [C: 03+1] Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [11:52:32] (03CR) 10Muehlenhoff: [C: 03+1] "The patch is correct, but the commit message is misleading. The reason we can revert this isn't because https://gerrit.wikimedia.org/r/c/o" [puppet] - 10https://gerrit.wikimedia.org/r/989877 (owner: 10Cwhite) [11:56:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P54701 and previous config saved to /var/cache/conftool/dbconfig/20240112-115621-marostegui.json [12:02:40] (03CR) 10Brouberol: [C: 03+2] spark-history: align Xmx/Xms values with amount of requested memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/990033 (https://phabricator.wikimedia.org/T354929) (owner: 10Brouberol) [12:03:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [12:03:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [12:05:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [12:06:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [12:06:09] (03CR) 10Phuedx: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [12:06:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [12:06:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [12:11:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T354336)', diff saved to https://phabricator.wikimedia.org/P54703 and previous config saved to /var/cache/conftool/dbconfig/20240112-121127-marostegui.json [12:11:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [12:11:35] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:11:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [12:11:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54704 and previous config saved to /var/cache/conftool/dbconfig/20240112-121150-marostegui.json [12:14:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54706 and previous config saved to /var/cache/conftool/dbconfig/20240112-121402-marostegui.json [12:21:15] (03CR) 10Majavah: "thanks for the initial reviews!" [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) (owner: 10Majavah) [12:21:25] (03CR) 10FNegri: [C: 03+2] dologmsg: standardize logging format [puppet] - 10https://gerrit.wikimedia.org/r/988669 (https://phabricator.wikimedia.org/T346631) (owner: 10FNegri) [12:29:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P54707 and previous config saved to /var/cache/conftool/dbconfig/20240112-122909-marostegui.json [12:29:54] (03CR) 10Brouberol: [C: 03+1] "Looks good! Thanks for the good commit message" [puppet] - 10https://gerrit.wikimedia.org/r/989461 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [12:32:37] (03CR) 10Stevemunene: [C: 03+2] Remove puppet references for druid1004_6 [puppet] - 10https://gerrit.wikimedia.org/r/989461 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [12:33:03] !log [urbanecm@mwmaint2002 ~]$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=dewiki --logwiki=metawiki 'Osip Knecht' 'Artquichotte39' [12:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:07] (03PS2) 10Slyngshede: Netfilter max connection tracking entires. [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) [12:34:53] (03CR) 10Slyngshede: Netfilter max connection tracking entires. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:37:02] (03PS1) 10Muehlenhoff: clouddumps: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990048 [12:37:55] (03CR) 10Ayounsi: [C: 03+1] Add automation for management router BGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [12:43:10] (03CR) 10Cathal Mooney: Add automation for management router BGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [12:44:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P54708 and previous config saved to /var/cache/conftool/dbconfig/20240112-124416-marostegui.json [12:44:21] (03PS2) 10Cathal Mooney: Add automation for management router BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) [12:44:57] (03PS1) 10Jcrespo: Move dbbackups from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/990050 (https://phabricator.wikimedia.org/T353526) [12:45:15] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419 (10jcrespo) [12:46:02] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10jcrespo) 05Open→03In progress a:03jcrespo [12:46:10] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10jcrespo) [12:46:17] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10jcrespo) p:05Triage→03High [12:51:20] (03CR) 10Ayounsi: Add automation for management router BGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [12:52:40] (03CR) 10Cathal Mooney: Add automation for management router BGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [12:53:25] (03PS3) 10Cathal Mooney: Add automation for management router BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) [12:53:31] (03CR) 10FNegri: [C: 03+1] "Setting "bin-copy-environment" is recommended in the official docs [1] to "increase the security of the started process", but I don't see " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) (owner: 10David Caro) [12:54:39] (03CR) 10Ayounsi: [C: 03+1] Add automation for management router BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [12:59:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T354336)', diff saved to https://phabricator.wikimedia.org/P54709 and previous config saved to /var/cache/conftool/dbconfig/20240112-125921-marostegui.json [12:59:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance [12:59:26] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:59:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance [12:59:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T354336)', diff saved to https://phabricator.wikimedia.org/P54710 and previous config saved to /var/cache/conftool/dbconfig/20240112-125944-marostegui.json [12:59:54] (03CR) 10Cathal Mooney: [C: 03+2] Add automation for management router BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [13:00:26] (03Merged) 10jenkins-bot: Add automation for management router BGP [homer/public] - 10https://gerrit.wikimedia.org/r/989917 (https://phabricator.wikimedia.org/T354809) (owner: 10Cathal Mooney) [13:03:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990048 (owner: 10Muehlenhoff) [13:05:44] (03CR) 10Slyngshede: Ganeti memory preassure alerting. (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:06:31] (03PS2) 10Slyngshede: Ganeti memory preassure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) [13:10:17] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Agree the new name reads better." [puppet] - 10https://gerrit.wikimedia.org/r/989444 (owner: 10Muehlenhoff) [13:12:03] (03CR) 10Majavah: [C: 03+1] clouddumps: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990048 (owner: 10Muehlenhoff) [13:13:33] (03PS1) 10Muehlenhoff: graphite: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990053 [13:14:42] (03CR) 10CI reject: [V: 04-1] graphite: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990053 (owner: 10Muehlenhoff) [13:18:46] (03PS2) 10Muehlenhoff: graphite: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990053 [13:19:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T354336)', diff saved to https://phabricator.wikimedia.org/P54711 and previous config saved to /var/cache/conftool/dbconfig/20240112-131904-marostegui.json [13:19:13] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:28:34] (03PS1) 10Tchanders: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990056 (https://phabricator.wikimedia.org/T351430) [13:31:08] (03CR) 10Filippo Giunchedi: [C: 03+1] Netfilter max connection tracking entires. [alerts] - 10https://gerrit.wikimedia.org/r/989188 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:33:44] (03CR) 10Kosta Harlan: [C: 03+1] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990056 (https://phabricator.wikimedia.org/T351430) (owner: 10Tchanders) [13:34:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P54712 and previous config saved to /var/cache/conftool/dbconfig/20240112-133410-marostegui.json [13:34:58] 10SRE, 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10Gehel) 05In progress→03Resolved [13:36:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990053 (owner: 10Muehlenhoff) [13:38:07] (03PS3) 10Slyngshede: Ganeti memory preassure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) [13:39:57] (03PS1) 10Muehlenhoff: airflow::instance: Pass web server port as an integer [puppet] - 10https://gerrit.wikimedia.org/r/990060 [13:41:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/990050 (https://phabricator.wikimedia.org/T353526) (owner: 10Jcrespo) [13:42:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990060 (owner: 10Muehlenhoff) [13:46:52] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) I will go for option 2 but I will have to do that next week since today is Friday. Thanks [13:49:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P54713 and previous config saved to /var/cache/conftool/dbconfig/20240112-134916-marostegui.json [13:52:02] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, team name needs fixing tho" [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:53:14] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/990053 (owner: 10Muehlenhoff) [13:54:45] (03PS3) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) [13:56:25] (03CR) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:58:22] (03CR) 10CI reject: [V: 04-1] P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:59:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "This is causing a PuppetConstantChange alert on titan hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [14:04:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T354336)', diff saved to https://phabricator.wikimedia.org/P54714 and previous config saved to /var/cache/conftool/dbconfig/20240112-140423-marostegui.json [14:04:39] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:13:20] (03CR) 10Jcrespo: [C: 03+2] Move dbbackups from cumin1001 to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/990050 (https://phabricator.wikimedia.org/T353526) (owner: 10Jcrespo) [14:18:43] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10jcrespo) Doing a test backup now. Let's wait until monday in case there is some edge case or issues we are not awar... [14:19:34] (03PS23) 10Brouberol: global_config: list IPs of hadoop master/workers and kerberos nodes [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894) [14:20:55] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:31:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [14:32:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [14:36:15] (03CR) 10Klausman: [C: 03+1] ml-services: increase limitranges for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990044 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:39:13] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [14:45:47] (03PS1) 10Hashar: Gerrit 3.7.6 and rebuild plugins [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990125 (https://phabricator.wikimedia.org/T354885) [14:50:36] (03PS9) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [14:54:51] (03PS1) 10Herron: thanos::rule: set reload service to stopped [puppet] - 10https://gerrit.wikimedia.org/r/990126 (https://phabricator.wikimedia.org/T353691) [14:55:37] (03CR) 10Klausman: [C: 03+2] ml-services: increase limitranges for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990044 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:58:18] (03Merged) 10jenkins-bot: ml-services: increase limitranges for ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/990044 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [14:59:13] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:32] (03PS2) 10Herron: thanos::rule: set reload service to stopped [puppet] - 10https://gerrit.wikimedia.org/r/990126 (https://phabricator.wikimedia.org/T353691) [15:00:02] (03CR) 10Filippo Giunchedi: [C: 04-1] "I given this a little more thought and I given the following:" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [15:01:04] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1094/co" [puppet] - 10https://gerrit.wikimedia.org/r/990126 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [15:03:25] (03CR) 10Herron: [C: 03+2] pyrra: reload pyrra-filesystem and thanos-rule on cfg change (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [15:06:34] (03Abandoned) 10Urbanecm: IP Masking: Set expiryAfterDays to a year [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973881 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [15:12:00] (03CR) 10Hashar: [C: 03+2] Gerrit 3.7.6 and rebuild plugins [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990125 (https://phabricator.wikimedia.org/T354885) (owner: 10Hashar) [15:14:00] (03CR) 10Hashar: [C: 04-2] Gerrit 3.7.6 and rebuild plugins [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990125 (https://phabricator.wikimedia.org/T354885) (owner: 10Hashar) [15:14:15] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:14:36] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:14:39] (03PS2) 10Hashar: Merge tag 'v3.7.6' into wmf/stable-3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990125 (https://phabricator.wikimedia.org/T354885) [15:14:58] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.7.6' into wmf/stable-3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990125 (https://phabricator.wikimedia.org/T354885) (owner: 10Hashar) [15:23:10] (03Merged) 10jenkins-bot: Merge tag 'v3.7.6' into wmf/stable-3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990125 (https://phabricator.wikimedia.org/T354885) (owner: 10Hashar) [15:28:40] (03CR) 10Dzahn: [C: 03+2] phabricator: use quickdatacopy for automatic home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [15:32:18] (03PS1) 10Hashar: Update Gerrit to v3.7.6 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990138 [15:33:56] (03PS1) 10Klausman: ml-services: Fix missing container line and indentation for Falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/990139 [15:35:20] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: Fix missing container line and indentation for Falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/990139 (owner: 10Klausman) [15:35:31] (03CR) 10Klausman: [C: 03+2] ml-services: Fix missing container line and indentation for Falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/990139 (owner: 10Klausman) [15:36:48] (03PS2) 10Hashar: Update Gerrit to v3.7.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990138 (https://phabricator.wikimedia.org/T354885) [15:36:58] (03Merged) 10jenkins-bot: ml-services: Fix missing container line and indentation for Falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/990139 (owner: 10Klausman) [15:37:41] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:46:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:46:57] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:50:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/990126 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [15:59:32] (03PS1) 10Clément Goubert: mw-api-int: Raise replicas to 125 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990141 [16:05:16] (03PS3) 10Hashar: Update Gerrit to v3.7.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990138 (https://phabricator.wikimedia.org/T354885) [16:06:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:06:43] (03CR) 10JMeybohm: [C: 03+1] mw-api-int: Raise replicas to 125 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990141 (owner: 10Clément Goubert) [16:11:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:16:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:17:36] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Raise replicas to 125 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990141 (owner: 10Clément Goubert) [16:18:27] (03Merged) 10jenkins-bot: mw-api-int: Raise replicas to 125 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990141 (owner: 10Clément Goubert) [16:19:58] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [16:20:15] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [16:20:24] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [16:20:34] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [16:23:45] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10database-backups: Migrate dbbackups from cumin1001 to cumin1002 - https://phabricator.wikimedia.org/T353526 (10jcrespo) The custom x1 backup worked well from cumin1002. Waiting now on the regular daily backup. [16:26:03] (03PS1) 10Cwhite: logstash: kafka input: add partition_assignment_strategy option [puppet] - 10https://gerrit.wikimedia.org/r/990166 (https://phabricator.wikimedia.org/T354904) [16:26:05] (03PS1) 10Cwhite: beta-logs: set kafka partition assignment strategy to cooperative_sticky [puppet] - 10https://gerrit.wikimedia.org/r/990167 (https://phabricator.wikimedia.org/T354904) [16:26:07] (03PS1) 10Cwhite: logstash: set kafka partition assignment strategy to cooperative_sticky [puppet] - 10https://gerrit.wikimedia.org/r/990168 (https://phabricator.wikimedia.org/T354904) [16:42:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:43:35] hnowlan: if you're still around ^ [16:44:05] (03PS12) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [16:44:59] I'm going to increase mw-jobrunner replica count [16:45:01] claime: yep, looking [16:45:18] sgtm [16:45:18] or at least prepare it while you look for a culprit [16:47:37] (03PS1) 10Clément Goubert: mw-jobrunner: raise replicas to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990148 [16:47:52] big jump in htmlcacheupdate it looks like [16:47:56] could be others [16:48:09] (03CR) 10Hnowlan: [C: 03+1] mw-jobrunner: raise replicas to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990148 (owner: 10Clément Goubert) [16:48:19] still somewhat unusual increase in load though [16:48:25] hnowlan: we should kill the pods with ~100% apcu frag [16:49:34] (03CR) 10Clément Goubert: [C: 03+2] mw-jobrunner: raise replicas to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990148 (owner: 10Clément Goubert) [16:50:28] (03Merged) 10jenkins-bot: mw-jobrunner: raise replicas to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/990148 (owner: 10Clément Goubert) [16:50:52] it's dropping off quite a bit already [16:51:04] still add the replicas though [16:51:21] yeah [16:51:40] !log cgoubert@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [16:51:47] something started at 15:30 that spiked saturation, then spiked again at 16:38 or so [16:51:53] !log cgoubert@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:52:15] !log cgoubert@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [16:52:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 3.963% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:52:25] !log cgoubert@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:56:42] ±70% increase in parsoidCachePrewarm from eqiad [16:56:50] that'd do it [16:58:01] oh yeah [16:59:58] still, going from ~60% to ~80% as a result of that is a bit concerning. It's an impactful job though [17:02:38] (03PS11) 10Dr0ptp4kt: webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [17:06:42] (03CR) 10Tchanders: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990056 (https://phabricator.wikimedia.org/T351430) (owner: 10Tchanders) [17:07:50] cleaned up the pods with 100% fragmentation [17:07:51] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/990056 (https://phabricator.wikimedia.org/T351430) (owner: 10Tchanders) [17:09:27] !log tchanders@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [17:10:21] !log tchanders@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [17:13:03] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) [17:14:31] hnowlan: ty <3 [17:16:50] !log tchanders@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [17:17:17] !log tchanders@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [17:18:00] (03PS1) 10Jdlrobson: Enable desktop history page for all mobile logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990152 (https://phabricator.wikimedia.org/T353388) [17:18:01] !log tchanders@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [17:18:27] !log tchanders@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [17:34:47] (03CR) 10Andrea Denisse: [C: 03+2] pontoon: Enroll pontoon-grafana-02 [puppet] - 10https://gerrit.wikimedia.org/r/989989 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [17:39:10] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) [17:50:10] 10SRE-OnFire, 10Observability-Alerting, 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata) [17:53:07] 10SRE-OnFire, 10Observability-Alerting, 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata) p:05High→03Medium Lowering priority due to lack of activity, we can revisit this if it continues to be a pressing matter. [18:03:27] PROBLEM - Check systemd state on aphlict1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:35] RECOVERY - Check systemd state on aphlict1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:57] (03CR) 10Brouberol: [C: 03+1] "Thanks for the detailed commit message!" [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [18:07:10] !log aphlict1002 - systemctl start logrotate [18:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:28] (03PS12) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [18:40:45] (03CR) 10CI reject: [V: 04-1] grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [18:41:39] (03PS13) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [18:41:57] (03PS1) 10Gergő Tisza: Log emails in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 [18:42:00] (03CR) 10CI reject: [V: 04-1] grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [18:42:15] (03PS1) 10Brouberol: spark-history: set production retention to 60 days [deployment-charts] - 10https://gerrit.wikimedia.org/r/990034 (https://phabricator.wikimedia.org/T354927) [18:42:48] (03PS2) 10Gergő Tisza: Log emails in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 [18:43:31] (03PS14) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [18:47:34] (03PS15) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [18:49:15] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1098/co" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [18:50:33] (03PS16) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [18:51:30] (03CR) 10Jforrester: [C: 03+1] Log emails in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990164 (owner: 10Gergő Tisza) [18:53:13] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [18:56:26] (03PS13) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [18:59:20] (03CR) 10Herron: [V: 03+1] grafana: add dashboard datasource usage (graphite) exporter (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [19:00:18] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:16] (03PS17) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [19:20:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:25:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:28:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:33:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:33:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:53:43] (03PS2) 10Htriedman: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 [19:54:22] (03CR) 10Htriedman: "What else needs to happen on this patch to get it out the door?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (owner: 10Htriedman) [20:13:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:23:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:05:59] (03CR) 10Dzahn: [C: 03+2] "we now have the part where /srv/homes is automatically synced by pulling from active server on passive server." [puppet] - 10https://gerrit.wikimedia.org/r/988111 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [21:07:52] (03CR) 10Dzahn: [C: 03+1] "do we really need approval if there is no functional change?" [puppet] - 10https://gerrit.wikimedia.org/r/989835 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [21:12:17] (03PS1) 10Dzahn: phabricator: auto-sync /srv/repos between servers [puppet] - 10https://gerrit.wikimedia.org/r/990247 (https://phabricator.wikimedia.org/T354221) [21:17:49] (03PS1) 10Urbanecm: [beta] Temporary accounts: Set expiry to 1 year [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990248 (https://phabricator.wikimedia.org/T344695) [21:19:48] (03CR) 10Urbanecm: [C: 03+2] "beta only, no-op for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990248 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [21:20:48] (03Merged) 10jenkins-bot: [beta] Temporary accounts: Set expiry to 1 year [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990248 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [21:37:36] (03PS1) 10Andrew Bogott: openstack trove: specify an exec image for mysql backups [puppet] - 10https://gerrit.wikimedia.org/r/990249 (https://phabricator.wikimedia.org/T349651) [21:39:59] (03CR) 10Andrew Bogott: [C: 03+2] openstack trove: specify an exec image for mysql backups [puppet] - 10https://gerrit.wikimedia.org/r/990249 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [21:41:21] (03PS1) 10Dzahn: phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) [21:42:30] (03CR) 10CI reject: [V: 04-1] phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [21:49:41] (03PS2) 10Dzahn: phabricator: add script/timer to create tarballs of home dirs [puppet] - 10https://gerrit.wikimedia.org/r/990250 (https://phabricator.wikimedia.org/T354221) [21:49:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:08:49] (03PS1) 10Dzahn: admin: remove ssh key of Connie Chen [puppet] - 10https://gerrit.wikimedia.org/r/990254 (https://phabricator.wikimedia.org/T354961) [22:10:34] (03CR) 10Dzahn: [C: 03+2] admin: remove ssh key of Connie Chen [puppet] - 10https://gerrit.wikimedia.org/r/990254 (https://phabricator.wikimedia.org/T354961) (owner: 10Dzahn) [22:15:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:16:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:28:34] !log dzahn@cumin1001 START - Cookbook sre.idm.logout Logging Conniecc1 out of all services on: 2213 hosts [22:29:18] (03PS3) 10Cwhite: Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877 [22:29:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Conniecc1 out of all services on: 2213 hosts [22:31:24] (03CR) 10Cwhite: [C: 03+2] Revert "Create initial stub role for logging-hd and configure for Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/989877 (owner: 10Cwhite) [22:33:05] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@21734.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:17] !log dzahn@cumin1001 START - Cookbook sre.idm.logout Logging Conniecc1 out of all services on: 2213 hosts [22:52:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Conniecc1 out of all services on: 2213 hosts [23:00:18] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:02:23] (03PS1) 10Dzahn: admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/990259 (https://phabricator.wikimedia.org/T354961) [23:03:59] (03CR) 10CI reject: [V: 04-1] admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/990259 (https://phabricator.wikimedia.org/T354961) (owner: 10Dzahn) [23:05:25] (03PS2) 10Dzahn: admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/990259 (https://phabricator.wikimedia.org/T354961) [23:07:01] (03CR) 10CI reject: [V: 04-1] admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/990259 (https://phabricator.wikimedia.org/T354961) (owner: 10Dzahn) [23:07:40] (03PS3) 10Dzahn: admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/990259 (https://phabricator.wikimedia.org/T354961) [23:09:24] (03CR) 10Dzahn: [C: 03+2] admin: remove conniecc1 from groups, set to absent [puppet] - 10https://gerrit.wikimedia.org/r/990259 (https://phabricator.wikimedia.org/T354961) (owner: 10Dzahn) [23:47:55] !log dzahn@cumin1001 START - Cookbook sre.idm.logout Logging Conniecc1 out of all services on: 2213 hosts [23:49:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Conniecc1 out of all services on: 2213 hosts [23:53:55] (03CR) 10Cwhite: [C: 03+1] "Change LGTM. Let's roll it out early next week unless there are objections." [puppet] - 10https://gerrit.wikimedia.org/r/967877 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [23:55:11] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar)