[00:01:06] (03CR) 10Scott French: "Thanks, Raine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [00:09:10] (03CR) 10Scott French: "Thanks, Raine!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [00:56:14] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:02:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 284378408 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7050408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:35] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:08:23] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:09:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:11:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 [01:11:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 (owner: 10TrainBranchBot) [01:18:44] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:19:48] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:24:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 (owner: 10TrainBranchBot) [01:30:13] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:30:53] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:41:15] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:51:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:54:29] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:00:56] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:01:29] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:06:11] FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [02:07:20] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 23s) [02:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable