[00:03:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P34557 and previous config saved to /var/cache/conftool/dbconfig/20220913-000340-ladsgroup.json [00:04:55] (03PS1) 10Dzahn: phabricator: move scap user sudo defaults to file, fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/831637 (https://phabricator.wikimedia.org/T313259) [00:05:20] (03CR) 10Dzahn: "as done in modules/profile/manifests/toolforge/base.pp" [puppet] - 10https://gerrit.wikimedia.org/r/831637 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [00:08:04] (03CR) 10Dzahn: [C: 03+2] phabricator: move scap user sudo defaults to file, fix puppet [puppet] - 10https://gerrit.wikimedia.org/r/831637 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [00:14:46] (03PS1) 10Dzahn: phabricator: use content, not source with a plain file [puppet] - 10https://gerrit.wikimedia.org/r/831638 (https://phabricator.wikimedia.org/T313259) [00:15:16] (03CR) 10Dzahn: [C: 03+2] phabricator: use content, not source with a plain file [puppet] - 10https://gerrit.wikimedia.org/r/831638 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [00:15:56] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: use content, not source with a plain file [puppet] - 10https://gerrit.wikimedia.org/r/831638 (https://phabricator.wikimedia.org/T313259) (owner: 10Dzahn) [00:18:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T314041)', diff saved to https://phabricator.wikimedia.org/P34558 and previous config saved to /var/cache/conftool/dbconfig/20220913-001846-ladsgroup.json [00:18:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [00:18:50] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [00:19:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [00:19:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T314041)', diff saved to https://phabricator.wikimedia.org/P34559 and previous config saved to /var/cache/conftool/dbconfig/20220913-001908-ladsgroup.json [00:21:06] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:00] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:30:26] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:00] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: syntax error in sudo [00:48:15] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: syntax error in sudo [00:49:02] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: syntax error in sudo [00:49:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: syntax error in sudo [00:49:44] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2001.codfw.wmnet with reason: syntax error in sudo [00:49:59] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2001.codfw.wmnet with reason: syntax error in sudo [00:51:09] (03PS1) 10Dzahn: phabricator: add double quotes around sudo config file [puppet] - 10https://gerrit.wikimedia.org/r/831640 [00:52:42] (03PS2) 10Dzahn: phabricator: add double quotes around sudo config file [puppet] - 10https://gerrit.wikimedia.org/r/831640 [00:53:02] (03PS3) 10Dzahn: phabricator: add double quotes around sudo config line [puppet] - 10https://gerrit.wikimedia.org/r/831640 [00:59:46] (03CR) 10Dzahn: [C: 03+2] phabricator: add double quotes around sudo config line [puppet] - 10https://gerrit.wikimedia.org/r/831640 (owner: 10Dzahn) [01:06:32] (03PS1) 10Dzahn: phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642 [01:08:32] (03PS2) 10Dzahn: phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642 [01:08:38] (03CR) 10Dzahn: [C: 03+2] phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642 (owner: 10Dzahn) [01:08:43] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: absent /etc/sudoers.d/scap_sudo_defaults [puppet] - 10https://gerrit.wikimedia.org/r/831642 (owner: 10Dzahn) [01:24:04] 10SRE, 10Observability-Metrics: SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10lmata) [01:35:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T314041)', diff saved to https://phabricator.wikimedia.org/P34560 and previous config saved to /var/cache/conftool/dbconfig/20220913-013555-ladsgroup.json [01:35:59] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (9) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:46:45] (JobUnavailable) firing: (11) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:51:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P34561 and previous config saved to /var/cache/conftool/dbconfig/20220913-015102-ladsgroup.json [01:51:45] (JobUnavailable) firing: (11) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0200) [02:06:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P34562 and previous config saved to /var/cache/conftool/dbconfig/20220913-020608-ladsgroup.json [02:06:14] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (9) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:19] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.1 [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831645 (https://phabricator.wikimedia.org/T314190) [02:07:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.1 [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831645 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [02:07:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:07:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:08:40] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: OpenSent - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:09:21] (ProbeDown) firing: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:10:36] (FrontendUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [02:10:40] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:11:12] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5003.eqsin.wmnet, cp5006.eqsin.wmnet, cp5014.eqsin.wmnet, cp5005.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5014.eqsin.wmnet, cp5006.eqsin.wmnet, cp5005.eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:11:12] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5014.eqsin.wmnet, cp5006.eqsin.wmnet, cp5002.eqsin.wmnet, cp5005.eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5003.eqsin.wmnet, cp5006.eqsin.wmnet, cp5002.eqsin.wmnet, cp5004.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:11:36] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 148 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:11:45] (JobUnavailable) firing: (6) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:19] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:12:24] hi, looking [02:12:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:12:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:13:00] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:14] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 346, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:13:24] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:13:24] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:13:29] just acked, looking as well rzl [02:13:54] PROBLEM - Check systemd state on ganeti5003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:00] looks like a transit blip in eqsin, I was just about to do a precautionary depool but maybe we're back? checking [02:14:21] (ProbeDown) resolved: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:15:10] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:15:23] HTTP requests received at eqsin went to zero for about a minute but look fully recovered [02:15:26] rzl: how did you pinpoint it to eqsin? [02:15:28] I am seeing "socket: permission denied" as the most common error message but as if it's over [02:15:36] (FrontendUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [02:16:09] jhathaway: just the alert text -- BGP status on cr3-eqsin, pybal alerts for cp5xxx, etc [02:16:21] good point! [02:16:50] (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:23] (ProbeDown) resolved: (3) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:17:45] and that HTTP requests data point was from https://grafana.wikimedia.org/goto/joxhmCGVz?orgId=1 [02:17:46] (Primary outbound port utilisation over 80% #page) firing: (3) Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:17:46] (Primary outbound port utilisation over 80% #page) firing: (3) Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:18:16] I believe "firing: [...] recovered from" means it is a recovery [02:19:38] yea, the original "firing" is still shown as 6 minutes ago [02:20:16] best guess, we dropped about 950,000 requests for cache-text in eqsin [02:20:22] over the course of about a minute [02:20:32] no lossage for upload, curiously [02:21:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T314041)', diff saved to https://phabricator.wikimedia.org/P34563 and previous config saved to /var/cache/conftool/dbconfig/20220913-022114-ladsgroup.json [02:21:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [02:21:19] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:21:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [02:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34564 and previous config saved to /var/cache/conftool/dbconfig/20220913-022136-ladsgroup.json [02:21:48] I don't see anything still broken though, and I don't think there's any sense in depooling just in case it happens again -- maybe if there's a second blip we'd depool before there's a third, but not otherwise [02:22:16] jhathaway, mutante: anything you think we need to do here? if not I'll write a very tiny IR and call it a night [02:22:16] also see mail from fastnetmon, btw [02:22:33] No, I agree with you [02:22:36] rzl: nope, I think that makes sense [02:22:43] yeah, I assume we lost one transit route and so we saturated the other [02:22:46] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:22:46] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:23:58] clicking that link first showed "10 min ago" and then it disappeared [02:24:08] kind of not matching the IRC message [02:24:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:24:30] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.1 [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831645 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [02:27:24] the RIPE Atlas probe are green again when looking from Singapore itself [02:27:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:27:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:30:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:31:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:31:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:32:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0300) [03:01:14] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831649 (https://phabricator.wikimedia.org/T314190) [03:01:16] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831649 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [03:01:32] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:57] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831649 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [03:02:22] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.1 refs T314190 [03:02:25] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [03:06:12] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:08:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:08:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:09:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:13:10] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:37:59] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.1 refs T314190 (duration: 35m 37s) [03:38:02] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [03:39:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:40:01] !log mwpresync@deploy1002 Pruned MediaWiki: 1.39.0-wmf.27 (duration: 01m 59s) [03:46:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:46:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:50:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 140 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:50:20] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [03:52:40] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:53:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:58:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:01:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:01:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:08:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:12:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T314041)', diff saved to https://phabricator.wikimedia.org/P34565 and previous config saved to /var/cache/conftool/dbconfig/20220913-041251-ladsgroup.json [04:12:55] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:27:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P34566 and previous config saved to /var/cache/conftool/dbconfig/20220913-042758-ladsgroup.json [04:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P34567 and previous config saved to /var/cache/conftool/dbconfig/20220913-044304-ladsgroup.json [04:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T314041)', diff saved to https://phabricator.wikimedia.org/P34568 and previous config saved to /var/cache/conftool/dbconfig/20220913-045811-ladsgroup.json [04:58:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [04:58:15] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:58:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [04:58:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T314041)', diff saved to https://phabricator.wikimedia.org/P34569 and previous config saved to /var/cache/conftool/dbconfig/20220913-045832-ladsgroup.json [05:00:10] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 249 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:18:50] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 257 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:28:08] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:35:08] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:42:00] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 192 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:53:30] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:00:05] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0600). [06:00:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 136 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:00:52] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:07:34] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:09:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34570 and previous config saved to /var/cache/conftool/dbconfig/20220913-060938-ladsgroup.json [06:09:42] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:17:00] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:21:38] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P34571 and previous config saved to /var/cache/conftool/dbconfig/20220913-062444-ladsgroup.json [06:26:16] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:33:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 285 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:38:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:38:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:38:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:39:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:39:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T314041)', diff saved to https://phabricator.wikimedia.org/P34572 and previous config saved to /var/cache/conftool/dbconfig/20220913-063908-ladsgroup.json [06:39:11] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:39:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P34573 and previous config saved to /var/cache/conftool/dbconfig/20220913-063951-ladsgroup.json [06:42:34] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 192 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:49:15] (03PS1) 10Muehlenhoff: Extend Tumult labs access by a week, current contract extension still WIP [puppet] - 10https://gerrit.wikimedia.org/r/831766 [06:53:16] (03CR) 10Muehlenhoff: [C: 03+2] Extend Tumult labs access by a week, current contract extension still WIP [puppet] - 10https://gerrit.wikimedia.org/r/831766 (owner: 10Muehlenhoff) [06:54:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 265 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:54:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34574 and previous config saved to /var/cache/conftool/dbconfig/20220913-065457-ladsgroup.json [06:54:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:55:02] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:55:08] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [06:55:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:00:04] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:03:34] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 289 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:06:16] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:12] (03CR) 10Muehlenhoff: [C: 03+2] druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:08:20] (03PS3) 10Muehlenhoff: druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) [07:11:40] !log jhuneidi@deploy1002 deploy-promote aborted: (duration: 00m 09s) [07:13:14] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831769 (https://phabricator.wikimedia.org/T314190) [07:13:16] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831769 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [07:13:18] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:00] we are rolling back from testwikis due to the high rate of fatals since the sync [07:14:03] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831769 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [07:14:18] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.28 refs T314190 [07:14:21] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [07:14:28] (03CR) 10Muehlenhoff: [C: 03+2] memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:16:46] log installing zlib security updates on buster [07:17:38] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 38 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:18:47] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.28 refs T314190 (duration: 04m 29s) [07:19:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:24:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:24:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:24:38] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:24:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:31:40] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:36:20] (03CR) 10JMeybohm: [C: 03+2] sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [07:40:06] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [07:43:27] (03PS1) 10Cathal Mooney: Depool codfw prior to core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/831800 (https://phabricator.wikimedia.org/T295690) [07:43:43] (03CR) 10Hashar: systemd: allow changing override filename (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [07:43:58] (03PS2) 10Hashar: systemd: allow changing override filename [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) [07:46:56] (03PS5) 10Hashar: jenkins: use upstream systemd definition [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) [07:48:21] (03CR) 10Hashar: "I have rebased since the parent change had some tweaks." [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [07:51:46] (03CR) 10Jcrespo: [C: 03+1] Depool codfw prior to core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/831800 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [08:03:46] (03CR) 10Vgutierrez: [C: 03+2] mtail::varnishsli: Consider req.body read|write errors as good requests [puppet] - 10https://gerrit.wikimedia.org/r/831528 (https://phabricator.wikimedia.org/T317051) (owner: 10Vgutierrez) [08:06:09] 10SRE, 10Traffic, 10Patch-For-Review: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez) 05Openβ†’03Stalled I'm waiting for a while after merging https://gerrit.wikimedia.org/r/831528, next steps aren't feasible in the short term [08:09:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 129 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:10:00] the train got rolled back by jeena at ~ 7:15 UTC [08:10:07] so we are solely running 1.39.0-wmf.28 [08:11:39] and there are a bunch of errrs from MySQLPrimaryPos such as `PHP Notice: Undefined index: position` | ` InvalidArgumentException: GTID set cannot be empty.` [08:13:43] (03CR) 10Cathal Mooney: [C: 03+2] Depool codfw prior to core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/831800 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [08:15:27] !log de-pooling codfw ahead of core router upgrades at the site [08:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:22] hashar: are the errors ongoing? or were past? [08:17:47] !log roll-restarting apache/FPM on mw canaries to pick up zlib security updates [08:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:12] jynus: it is ongoing, apparently due to a serialization issue between .28 and the new 1.40.0-wmf.1 [08:18:16] I am digging ;-] [08:19:33] what I see is "Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending" on the job queue [08:23:22] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 168 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:27:47] !log cmooney@cumin1001 START - Cookbook sre.network.cf [08:27:47] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:27:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet [08:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:28:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 258 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:28:50] (03CR) 10Btullis: [C: 03+2] Add the locations of the new hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/831532 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [08:31:35] hmm I am not qualified for that serialization issue, it is marked as a blocker ( https://phabricator.wikimedia.org/T317606 ) [08:32:48] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:32:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:33:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [08:34:17] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [08:35:05] jbond: I think the first can be merged right now, the second affects Jenkins and requires some manual steps for deployment but I am gathering evidence for the mw train blocker ;) [08:36:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet [08:37:30] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10Tanuja_Doriya) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks... [08:37:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:39:56] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet [08:40:26] Amir1: duesen: I am around for the train blocker if need be, but I can't say I understand what is going on :-\ [08:40:34] (03PS1) 10MVernon: swift: remove ms-be20[28-39] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/831812 (https://phabricator.wikimedia.org/T294549) [08:41:01] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10WMDE-leszek) [08:42:12] and there are bunches of `PHP Notice: apcu_fetch(): Error at offset 42 of 856 bytes` [08:43:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T314041)', diff saved to https://phabricator.wikimedia.org/P34575 and previous config saved to /var/cache/conftool/dbconfig/20220913-084307-ladsgroup.json [08:43:11] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:43:14] hashar: the error doesn't happen on the new version, it happens in the old. It's just a incompatible of old/new serialization [08:43:45] what I noticed is that the chronologyprotector cache key version got bumped [08:44:01] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10WMDE-leszek) Tagging wmf-legal was a mistake, apologies. [08:44:02] so I kind of expect the caches to be namespaced by that [08:44:11] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap, wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10Tanuja_Doriya) [08:46:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet [08:46:59] !log Disabled LVS/PyBal peerings on cr1-codfw ain advance of upgrade to router. [08:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:18] (03CR) 10Vgutierrez: "please let's move forward with this. It is taking too much time|energy" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [08:53:59] (03PS1) 10Marostegui: mariadb: Promote db2112 to s1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831814 (https://phabricator.wikimedia.org/T317614) [08:54:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 37 hosts with reason: Primary switchover s1 T317614 [08:54:05] T317614: Switchover codfw s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T317614 [08:54:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: Primary switchover s1 T317614 [08:54:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2112 with weight 0 T317614', diff saved to https://phabricator.wikimedia.org/P34576 and previous config saved to /var/cache/conftool/dbconfig/20220913-085456-marostegui.json [08:56:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [08:56:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10JMeybohm) [08:56:40] !log Flipping primary routing engine to RE1 on cr1-codfw (disruptive) as part of upgrade. [08:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) 05Openβ†’03Resolved Merged as `sre.k8s.pool-depool-cluster` [08:57:02] (03CR) 10Ladsgroup: [C: 03+1] "I think switchmaster can do this πŸ˜„" [puppet] - 10https://gerrit.wikimedia.org/r/831814 (https://phabricator.wikimedia.org/T317614) (owner: 10Marostegui) [08:57:11] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2112 to s1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831814 (https://phabricator.wikimedia.org/T317614) (owner: 10Marostegui) [08:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P34577 and previous config saved to /var/cache/conftool/dbconfig/20220913-085814-ladsgroup.json [08:59:21] PROBLEM - Host cr1-codfw #page is DOWN: PING CRITICAL - Packet loss = 100% [08:59:38] I think that is expected? [08:59:42] I think so [08:59:48] ok [09:00:02] but we should make sure it is 0 impact [09:00:02] * Emperor arrives from the p.age [09:00:19] gonna ack it [09:01:02] this is the work topranks is doing, presumably? [09:01:12] (03CR) 10ClΓ©ment Goubert: [C: 03+1] "I went through the linked tasks from 2db4b19f660 and couldn't find the reason why we need to extend ephemeral port range. Do we actually h" [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) (owner: 10Dzahn) [09:01:20] em yes... [09:01:33] Emporer: yes... apologies thought I'd downtimed the host [09:01:52] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:01:54] PROBLEM - Host cr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:28] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt with reason: router upgrade [09:02:46] 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Page on etcdmirror critical status - https://phabricator.wikimedia.org/T317402 (10Clement_Goubert) 05Openβ†’03Resolved a:03Clement_Goubert [09:02:50] I messed up syntax it seems and it didn't do what I thought. [09:02:54] apologies for noise [09:02:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt with reason: router upgrade [09:02:55] RECOVERY - Host cr1-codfw #page is UP: PING OK - Packet loss = 0%, RTA = 45.62 ms [09:03:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=927fadc1-f5b2-478f-95ce-98bfc47881a9) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th... [09:03:07] I can access codfw and edit as normal [09:03:19] (03CR) 10Hashar: [C: 03+1] scap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:04:16] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:06:10] topranks: for clarification, I know codfw was depooled, but the page itself was not expected within your maintenance? [09:06:12] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:02] jynus: Page shouldn't have fired - host should have been downtimed but I'd messed up the command and that didn't happen [09:07:08] It's properly downtimed now. [09:07:20] ok, that is not important [09:07:27] but the maintenance was done as expected, right? [09:07:54] maintenance is ongoing, but all going to plan, everything right now routing via cr2-codfw so no services should be affected [09:08:03] thanks for clarification [09:08:13] just a monitoring issues then [09:08:18] RECOVERY - Host cr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.60 ms [09:08:30] *downtime issue [09:08:50] yeah exactly [09:11:06] !log volans@cumin1001 START - Cookbook sre.network.cf [09:11:06] !log volans@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [09:13:03] 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) [09:13:14] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:18] 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) p:05Triageβ†’03Medium [09:13:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P34578 and previous config saved to /var/cache/conftool/dbconfig/20220913-091320-ladsgroup.json [09:14:18] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap, wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10Aklapper) [09:14:20] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317612 (10Aklapper) [09:15:06] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10Aklapper) [09:19:42] !log Starting s1 codfw failover from db2103 to db2112 - T317614 [09:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:45] T317614: Switchover codfw s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T317614 [09:20:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2112 to s1 primary T317614', diff saved to https://phabricator.wikimedia.org/P34579 and previous config saved to /var/cache/conftool/dbconfig/20220913-092032-root.json [09:21:42] 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) [09:22:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2103 T317614', diff saved to https://phabricator.wikimedia.org/P34580 and previous config saved to /var/cache/conftool/dbconfig/20220913-092200-root.json [09:23:41] (03PS1) 10Marostegui: db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831824 [09:24:28] (03CR) 10Marostegui: [C: 03+2] db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831824 (owner: 10Marostegui) [09:24:30] (03CR) 10Jbond: [C: 03+2] systemd: allow changing override filename [puppet] - 10https://gerrit.wikimedia.org/r/831534 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [09:24:34] (03CR) 10Jbond: [C: 03+2] jenkins: use upstream systemd definition [puppet] - 10https://gerrit.wikimedia.org/r/808900 (https://phabricator.wikimedia.org/T308637) (owner: 10Hashar) [09:25:01] marostegui: happy fopr me to merge yours [09:25:33] !log Stopped Puppet on contint2001 for a Jenkins systemd change [09:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:49] (Emergency syslog message) firing: Alert for device cr1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:26:52] jbond: there are multiple puppet changes pending from you [09:27:02] jbond: go ahead [09:27:11] ack merging (cc hashar ) [09:27:27] ready to run puppet and validate on releases1002 [09:27:37] will continue over private chat [09:28:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T314041)', diff saved to https://phabricator.wikimedia.org/P34581 and previous config saved to /var/cache/conftool/dbconfig/20220913-092826-ladsgroup.json [09:28:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [09:28:30] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:28:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [09:28:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:28:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [09:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T314041)', diff saved to https://phabricator.wikimedia.org/P34582 and previous config saved to /var/cache/conftool/dbconfig/20220913-092904-ladsgroup.json [09:33:14] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet [09:33:38] !log Enabling Puppet on contint2001 for Jenkins systemd change [09:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:48] (Emergency syslog message) resolved: Device cr1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:37:04] !log Restarting CI Jenkins on contint2001 (with new systemd service) [09:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:49] (Emergency syslog message) firing: Alert for device cr1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:41:05] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [09:41:58] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10HasanAkgun_WMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public task... [09:42:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet [09:43:28] (03PS1) 10Marostegui: Revert "db2103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831564 [09:45:02] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [09:45:42] (03CR) 10Sergio Gimeno: [C: 04-1] "Do not merge until T305406 is resolved" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830202 (https://phabricator.wikimedia.org/T305408) (owner: 10Sergio Gimeno) [09:46:08] (03CR) 10Marostegui: [C: 03+2] Revert "db2103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/831564 (owner: 10Marostegui) [09:46:49] (Device rebooted) firing: Alert for device cr1-codfw.wikimedia.org - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [09:50:50] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10Aklapper) Hi, please use the template at https://phabricator.wikimedia.org/project/profile/1564/ and update any potential onboarding docs, if applicable. Thanks. [09:50:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging2002.codfw.wmnet with reason: Kafka PKI upgrade [09:51:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging2002.codfw.wmnet with reason: Kafka PKI upgrade [09:51:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34583 and previous config saved to /var/cache/conftool/dbconfig/20220913-095137-root.json [09:51:45] (03CR) 10Elukey: [C: 03+2] Move kafka on kafka-logging2002 to a PKI-based TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/831588 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [09:51:49] (Device rebooted) resolved: Device cr1-codfw.wikimedia.org recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [09:52:46] !log move kafka-logging2002 to PKI-based TLS certs [09:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:22] (03CR) 10Jbond: [C: 03+1] "LGTM, also adding kieth for an additional sanity check" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [10:00:54] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) [10:02:52] (03PS1) 10Elukey: role::kafka::logging: move kafka on all codfw nodes to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) [10:04:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [10:04:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37242/console" [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [10:05:00] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:06:18] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:21] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) > @Milimetric Can you verify that this really is what you want? Going forward, it sounds like perhaps amending the documentation to be a little... [10:06:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34584 and previous config saved to /var/cache/conftool/dbconfig/20220913-100642-root.json [10:10:26] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:00] 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) @MatthewVernon we would like to get your input here. Before tuning Swift's current TLS termination we'd like to know what are your plans regarding it. Is a migration to envoy in... [10:14:26] (03PS1) 10Muehlenhoff: ml-etcd: Also include staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/831832 [10:16:03] !log Flipping master RE on cr1-codfw to backup as part of upgrade [10:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:00] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:17:45] (03PS1) 10FNegri: Fix get_osd_tree to handle empty children list [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) [10:20:38] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:50] (03CR) 10Elukey: "We don't have ml-etcd staging nodes, is it the right alias?" [puppet] - 10https://gerrit.wikimedia.org/r/831832 (owner: 10Muehlenhoff) [10:21:20] (03CR) 10Elukey: [C: 03+1] "of course we have, sorry, forgot about them :D" [puppet] - 10https://gerrit.wikimedia.org/r/831832 (owner: 10Muehlenhoff) [10:21:36] PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34585 and previous config saved to /var/cache/conftool/dbconfig/20220913-102147-root.json [10:22:06] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:22:16] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:22:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:22:32] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:22:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T314041)', diff saved to https://phabricator.wikimedia.org/P34586 and previous config saved to /var/cache/conftool/dbconfig/20220913-102232-ladsgroup.json [10:22:36] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:22:52] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:24:22] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) 05Resolvedβ†’03Open sudo cookbook -d sre.dns.netbox This command is requiring me to enter password and not working [10:24:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:00] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:54] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:26:14] (03PS1) 10Krinkle: rdbms: Bump ChronologyProtector cache key version [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831847 (https://phabricator.wikimedia.org/T317606) [10:26:26] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:36] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:26:50] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:26:56] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:26:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:27:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:59] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) @Jclark-ctr you need to use the `secure-cookbook` binary instead of the `cookbook` one. See also the related patch above for how thats configu... [10:31:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T317627 [10:35:39] T317627: Switchover s2 codfw master (db2104 -> db2107) - https://phabricator.wikimedia.org/T317627 [10:35:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T317627 [10:35:56] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [10:36:09] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:36:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:36:19] (03PS1) 10Marostegui: mariadb: Promote db2107 to s2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831837 (https://phabricator.wikimedia.org/T317627) [10:36:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2107 with weight 0 T317627', diff saved to https://phabricator.wikimedia.org/P34587 and previous config saved to /var/cache/conftool/dbconfig/20220913-103621-marostegui.json [10:36:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST jobs) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2107 from api T317627', diff saved to https://phabricator.wikimedia.org/P34588 and previous config saved to /var/cache/conftool/dbconfig/20220913-103658-marostegui.json [10:37:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34589 and previous config saved to /var/cache/conftool/dbconfig/20220913-103705-root.json [10:38:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2107 to s2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/831837 (https://phabricator.wikimedia.org/T317627) (owner: 10Marostegui) [10:39:08] (03CR) 10David Caro: [C: 03+1] "LGTM, some nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [10:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:46:07] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.6.0 [software/homer] - 10https://gerrit.wikimedia.org/r/831838 [10:46:14] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Bump ChronologyProtector cache key version [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831847 (https://phabricator.wikimedia.org/T317606) (owner: 10Krinkle) [10:48:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:50:10] (03CR) 10FNegri: [C: 04-1] "I tried running 'cookbook wmcs.ceph.osd.bootstrap_and_add --new-osd-fqdn cloudcephosd1030.eqiad.wmnet --only-check' and the jumbo ping che" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [10:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34590 and previous config saved to /var/cache/conftool/dbconfig/20220913-105210-root.json [10:55:59] !log Starting s2 codfw failover from db2104 to db2107 - T317627 [10:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:02] T317627: Switchover s2 codfw master (db2104 -> db2107) - https://phabricator.wikimedia.org/T317627 [10:56:20] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.6.0 [software/homer] - 10https://gerrit.wikimedia.org/r/831838 (owner: 10Volans) [10:56:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [10:56:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2107 to s2 codfw primary T317627', diff saved to https://phabricator.wikimedia.org/P34591 and previous config saved to /var/cache/conftool/dbconfig/20220913-105642-marostegui.json [10:57:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104 T317627', diff saved to https://phabricator.wikimedia.org/P34592 and previous config saved to /var/cache/conftool/dbconfig/20220913-105733-root.json [10:59:46] (03PS1) 10Cathal Mooney: Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690) [11:01:54] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.0 [software/homer] - 10https://gerrit.wikimedia.org/r/831838 (owner: 10Volans) [11:01:55] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10jcrespo) Ganeti exporter has been unavailable since 20:17:30: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1 W... [11:02:57] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 12 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [11:03:06] (03Merged) 10jenkins-bot: rdbms: Bump ChronologyProtector cache key version [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831847 (https://phabricator.wikimedia.org/T317606) (owner: 10Krinkle) [11:03:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [11:04:08] (03PS1) 10Btullis: Put the new hadoop nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/831841 (https://phabricator.wikimedia.org/T311210) [11:06:48] (03CR) 10Cathal Mooney: [C: 03+2] Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [11:06:52] (03PS2) 10Jgreen: DMARC External Domain Verification for wikipedia.org and w.wiki. [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) [11:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:07:15] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [11:07:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34593 and previous config saved to /var/cache/conftool/dbconfig/20220913-110715-root.json [11:07:28] (03Merged) 10jenkins-bot: Disable VRRP auth between CRs in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/831840 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [11:07:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:07:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:07:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:07:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T312863)', diff saved to https://phabricator.wikimedia.org/P34594 and previous config saved to /var/cache/conftool/dbconfig/20220913-110755-ladsgroup.json [11:07:58] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [11:08:02] (03CR) 10Jgreen: [C: 03+2] DMARC External Domain Verification for wikipedia.org and w.wiki. [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen) [11:08:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:08:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:08:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34595 and previous config saved to /var/cache/conftool/dbconfig/20220913-110850-root.json [11:09:21] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:09:32] !log ladsgroup@deploy1002 Synchronized php-1.40.0-wmf.1/includes/libs/rdbms/ChronologyProtector.php: Backport: [[gerrit:831847|rdbms: Bump ChronologyProtector cache key version (T317606)]] (duration: 03m 49s) [11:09:35] T317606: PHP Notice: Undefined index: asOfTime - https://phabricator.wikimedia.org/T317606 [11:11:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:12:27] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [11:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:14:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:33] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt [11:14:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt [11:15:02] !log completed cr1-codfw upgrade, will proceed to cr2-codfw shortly [11:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:20] (03PS1) 10Jgreen: Fix DMARC external domain verification records. [dns] - 10https://gerrit.wikimedia.org/r/831843 (https://phabricator.wikimedia.org/T211401) [11:17:13] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (GET namespaces) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:24] (03CR) 10Jgreen: [C: 03+2] Fix DMARC external domain verification records. [dns] - 10https://gerrit.wikimedia.org/r/831843 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen) [11:19:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:57] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt with reason: router upgrade [11:21:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt with reason: router upgrade [11:21:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T314041)', diff saved to https://phabricator.wikimedia.org/P34596 and previous config saved to /var/cache/conftool/dbconfig/20220913-112112-ladsgroup.json [11:21:16] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:23:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/831812 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [11:23:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34597 and previous config saved to /var/cache/conftool/dbconfig/20220913-112355-root.json [11:24:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [11:24:33] (03CR) 10MVernon: [C: 03+2] swift: remove ms-be20[28-39] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/831812 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [11:27:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:27:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:27:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:28:09] PROBLEM - Hadoop HDFS Namenode FSImage Age on an-master1002 is CRITICAL: FILE_AGE CRITICAL: /srv/hadoop/name/current/VERSION is 7281 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:28:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:28:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P34598 and previous config saved to /var/cache/conftool/dbconfig/20220913-112818-ladsgroup.json [11:28:22] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:34:23] !log Upgrading CI Jenkins T317418 [11:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:27] T317418: Upgrade Jenkins to latest LTS 2.361.1 - https://phabricator.wikimedia.org/T317418 [11:36:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P34599 and previous config saved to /var/cache/conftool/dbconfig/20220913-113619-ladsgroup.json [11:37:36] 10SRE, 10Infrastructure-Foundations, 10Mail: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Jgreen) 05Openβ†’03Resolved a:03Jgreen ;; ANSWER SECTION: w.wiki._report._dmarc.wikimedia.org. 3600 IN TXT "v=DMARC1;" ;; ANSWER SECTION: wiki... [11:39:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34600 and previous config saved to /var/cache/conftool/dbconfig/20220913-113900-root.json [11:39:06] 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) Thanks for asking! No, we don't currently have a move to envoy on our roadmap (I'm afraid there is too much higher-priority stuff there right now), though I'm not opposed to... [11:51:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P34601 and previous config saved to /var/cache/conftool/dbconfig/20220913-115125-ladsgroup.json [11:54:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34602 and previous config saved to /var/cache/conftool/dbconfig/20220913-115405-root.json [11:54:38] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [11:57:09] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) One of the hosts actually do report having "None" oper_vcpus, rather than 0. ` instances[30] {'disk_usage': 51328,... [11:57:57] !Disabling transit and ixp BGP on cr2-codfw in advance of software upgrade [11:58:04] !log Disabling transit and ixp BGP on cr2-codfw in advance of software upgrade [11:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:22] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10MoritzMuehlenhoff) >>! In T311288#8232068, @SLyngshede-WMF wrote: > One of the hosts actually do report having "None" oper_vcpus, rat... [12:02:50] RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 76 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:03:35] (03PS1) 10Jbond: P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868 [12:03:37] (03PS1) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 [12:03:44] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:04:41] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [12:05:46] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831868 (owner: 10Jbond) [12:06:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T314041)', diff saved to https://phabricator.wikimedia.org/P34603 and previous config saved to /var/cache/conftool/dbconfig/20220913-120632-ladsgroup.json [12:06:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [12:06:36] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:06:46] (03PS2) 10Jbond: P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868 [12:06:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [12:06:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34604 and previous config saved to /var/cache/conftool/dbconfig/20220913-120653-ladsgroup.json [12:06:55] (03PS2) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 [12:08:04] (03CR) 10Volans: [C: 04-1] "wrong config file?" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond) [12:08:24] (03PS3) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 [12:09:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34605 and previous config saved to /var/cache/conftool/dbconfig/20220913-120910-root.json [12:09:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37245/console" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond) [12:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T314041)', diff saved to https://phabricator.wikimedia.org/P34606 and previous config saved to /var/cache/conftool/dbconfig/20220913-121204-ladsgroup.json [12:12:08] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:12:46] (Processor usage over 85%) firing: Alert for device cr2-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [12:14:02] (03CR) 10Jbond: [V: 03+1] P:spicerack: add firmware directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond) [12:14:36] (03CR) 10Btullis: [C: 03+2] Put the new hadoop nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/831841 (https://phabricator.wikimedia.org/T311210) (owner: 10Btullis) [12:16:38] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) The problem is this host: dispatch-be1001.eqiad.wmnet which is configured to be down. It does in fact have no vCPUs a... [12:17:46] (Processor usage over 85%) resolved: Device cr2-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [12:21:46] (Emergency syslog message) firing: Alert for device cr2-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:24:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34607 and previous config saved to /var/cache/conftool/dbconfig/20220913-122415-root.json [12:26:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:27:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P34608 and previous config saved to /var/cache/conftool/dbconfig/20220913-122710-ladsgroup.json [12:31:46] (Emergency syslog message) resolved: Device cr2-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:31:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:33:22] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10MoritzMuehlenhoff) >>! In T311288#8232088, @SLyngshede-WMF wrote: > The problem is this host: dispatch-be1001.eqiad.wmnet which is co... [12:34:17] (03PS1) 10Jgreen: Add fundraising host frdm1001.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/831870 (https://phabricator.wikimedia.org/T317443) [12:36:00] (03PS1) 10Slyngshede: Downed VMs will report None as vCPU allocation. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288) [12:36:26] (03CR) 10Jgreen: [C: 03+2] Add fundraising host frdm1001.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/831870 (https://phabricator.wikimedia.org/T317443) (owner: 10Jgreen) [12:38:50] (03PS2) 10Slyngshede: Downed VMs will report None as vCPU allocation. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288) [12:41:10] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) The patch use the oper_state of the instances, rather than just assuming that None should be 0.... [12:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P34609 and previous config saved to /var/cache/conftool/dbconfig/20220913-124217-ladsgroup.json [12:46:31] !log forcing non-graceful RE switchover on cr2-codfw as part of upgrade [12:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T314041)', diff saved to https://phabricator.wikimedia.org/P34610 and previous config saved to /var/cache/conftool/dbconfig/20220913-124758-ladsgroup.json [12:48:01] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:48:21] (03CR) 10Jcrespo: "LGTM, but note I don't have the context of this." [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [12:52:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:38] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:52:46] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:53:52] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:00] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:54:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 138, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:55:00] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:55:08] PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:56:12] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:56:22] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:56:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:57:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T314041)', diff saved to https://phabricator.wikimedia.org/P34611 and previous config saved to /var/cache/conftool/dbconfig/20220913-125723-ladsgroup.json [12:57:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:57:28] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:57:30] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:32] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:57:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T314041)', diff saved to https://phabricator.wikimedia.org/P34612 and previous config saved to /var/cache/conftool/dbconfig/20220913-125745-ladsgroup.json [12:59:53] !log Switching active RE back to RE1 on cr1-codfw as firmware hadn't been loaded while it was master [12:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1300) [13:00:05] phuedx and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1300) [13:00:11] o/ [13:00:17] o/ [13:00:25] PROBLEM - Check systemd state on an-worker1144 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 138, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:02:11] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:02:27] RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P34613 and previous config saved to /var/cache/conftool/dbconfig/20220913-130304-ladsgroup.json [13:03:07] o/ [13:03:28] I can deploy! [13:03:35] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:13] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:08] (03PS1) 10KartikMistry: Enable Section Translation in Odia Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831872 (https://phabricator.wikimedia.org/T313300) [13:05:47] (03PS3) 10Lucas Werkmeister (WMDE): Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 (owner: 10Phuedx) [13:05:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 (owner: 10Phuedx) [13:06:44] (03Merged) 10jenkins-bot: Remove $wgWMESearchRelevancePages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824685 (owner: 10Phuedx) [13:07:14] ok, grep confirms wmf-config/InitialiseSettings.php is the only remaining file with a reference to SearchRelevancePages [13:07:43] phuedx: the first change is on mwdebug1001, do you quickly want to check that nothing’s broken? [13:07:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [13:07:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [13:07:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance [13:07:52] (otherwise I’m also okay with syncing it directly, looks safe enough) [13:07:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance [13:08:09] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:08:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [13:08:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [13:08:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance [13:08:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet with reason: Maintenance [13:08:53] Lucas_WMDE: A spot check of a couple of wikis on mwdebug1001 and I see no obvious breakages. As you say, the variable isn't used anywhee [13:08:54] *re [13:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:08:59] ok! [13:10:11] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:11:18] (03PS4) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 [13:11:29] reviewing the second change [13:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T314041)', diff saved to https://phabricator.wikimedia.org/P34614 and previous config saved to /var/cache/conftool/dbconfig/20220913-131148-ladsgroup.json [13:11:49] apparently conf-labs-en_rtlwiki.json gets "rate": 0, not "rate": 1, in the diffConfig output [13:11:52] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:12:19] I guess that’s not in the wikipedia dblist [13:12:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37246/console" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond) [13:12:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:13:01] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:13:14] (03PS3) 10Lucas Werkmeister (WMDE): testwiki: Add mediawiki.edit_attempt stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) (owner: 10Phuedx) [13:13:16] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824685|Remove $wgWMESearchRelevancePages]] (unused) (duration: 03m 53s) [13:13:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:14:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "diffConfig looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) (owner: 10Phuedx) [13:14:13] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:14:28] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:14:46] !log Flipping back to RE0 on cr2-codfw (last disruptive switch) [13:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:55] (03Merged) 10jenkins-bot: testwiki: Add mediawiki.edit_attempt stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) (owner: 10Phuedx) [13:15:51] phuedx: the edit_attempt change is on mwdebug1001, can you test it? [13:15:55] On it [13:17:49] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:17:49] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:49] Lucas_WMDE: LGTM. I see the stream definition on testwiki but not on enwiki or dewiki for example [13:17:57] ok \o/ [13:18:03] thanks [13:18:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P34615 and previous config saved to /var/cache/conftool/dbconfig/20220913-131811-ladsgroup.json [13:18:27] syncing [13:19:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 138, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:18] !log set thanos ring replicas to 3.85 T311690 [13:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:21] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [13:20:19] (03PS1) 10Elukey: admin_ng: set more values for Istio DR in ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/831876 (https://phabricator.wikimedia.org/T313915) [13:20:27] PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:20:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:20:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:21:25] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:21:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:21:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:826234|testwiki: Add mediawiki.edit_attempt stream (T309013)]] (1/2) (duration: 03m 39s) [13:22:06] T309013: EditAttemptStep Migration to MP - https://phabricator.wikimedia.org/T309013 [13:22:19] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:22:23] Lucas_WMDE: Thanks <3 [13:22:27] np :) [13:22:30] (still syncing IS-labs ^^) [13:23:02] koi: don’t you think it’s too early for that tnwiki change? it doesn’t look like anyone else voted for it (or reacted at all) at https://tn.wikipedia.org/wiki/Wikipedia:Patlelo_ya_set%C5%A1haba#Enabling_Extended_Confirmed_User_Group [13:23:18] I can see in the recent changes that Rebel Agent is the most active editor there, but they’re not the only one either [13:24:05] hmm, more than one week passed, and the author of that task is a (temporary) sysop at tnwiki [13:24:13] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:24:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:25:12] let me see if there are any other recent-ish tnwiki config changes and what kind of community approval they had [13:25:19] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:25:49] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:826234|testwiki: Add mediawiki.edit_attempt stream (T309013)]] (2/2) (duration: 03m 33s) [13:26:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P34616 and previous config saved to /var/cache/conftool/dbconfig/20220913-132654-ladsgroup.json [13:27:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:27:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:53] hm, not finding much in the way of tnwiki config changes [13:28:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:29:35] koi: I don’t know if other deployers would handle this differently, but to me there’s not enough community consensus to deploy that, sorry [13:29:52] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan AkgΓΌn (WMDE) - https://phabricator.wikimedia.org/T317637 (10HasanAkgun_WMDE) [13:30:21] Lucas_WMDE: fair enough, I'll wait for another couple of days [13:31:05] I would feel better if the community page had an update like β€œif no one objects until X then this will be deployed” [13:31:08] but idk if that’s usual or not [13:31:21] if another deployer wants to go ahead with that config change, I don’t mind either [13:33:17] !log UTC afternoon backport+config window done [13:33:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T314041)', diff saved to https://phabricator.wikimedia.org/P34617 and previous config saved to /var/cache/conftool/dbconfig/20220913-133317-ladsgroup.json [13:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:33:22] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:33:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:33:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T314041)', diff saved to https://phabricator.wikimedia.org/P34618 and previous config saved to /var/cache/conftool/dbconfig/20220913-133339-ladsgroup.json [13:36:01] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan AkgΓΌn (WMDE) - https://phabricator.wikimedia.org/T317637 (10karapayneWMDE) I am the Engineering manager for wikidata and I approve this request and confirm Hasan's affiliation with WDME. [13:41:29] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:42:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P34619 and previous config saved to /var/cache/conftool/dbconfig/20220913-134201-ladsgroup.json [13:51:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:28] ^ restarting bird on doh/durum, so expected. should clear up themselves [13:56:47] if not, then it's a problem and we will see :) [13:57:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T314041)', diff saved to https://phabricator.wikimedia.org/P34620 and previous config saved to /var/cache/conftool/dbconfig/20220913-135707-ladsgroup.json [13:57:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:57:12] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:57:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:57:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34621 and previous config saved to /var/cache/conftool/dbconfig/20220913-135729-ladsgroup.json [13:59:24] (03PS22) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [14:00:18] (03PS23) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [14:01:35] (03CR) 10Volans: [C: 04-1] "one typo, lgtm otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond) [14:02:15] (03PS24) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [14:02:17] (03CR) 10Hnowlan: [C: 03+2] Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik) [14:02:19] (03CR) 10Jbond: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [14:03:16] (03CR) 10Jbond: O:puppetmaster::standalone: move to useing P:puppetmaster::common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [14:03:19] (03CR) 10Jbond: [C: 03+2] O:puppetmaster::standalone: move to useing P:puppetmaster::common [puppet] - 10https://gerrit.wikimedia.org/r/831507 (owner: 10Jbond) [14:07:41] !log re-activating Transit on IX BGP on cr2-codfw [14:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:29] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:12:46] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt [14:12:47] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt [14:13:58] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:16:06] (03Merged) 10jenkins-bot: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) (owner: 10Vlad.shapik) [14:17:00] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:28] !log Core router upgrade in codfw complete - maintenance closed. [14:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:20:05] (03PS1) 10Cathal Mooney: Re-pool codfw after upgrading core routers on site [dns] - 10https://gerrit.wikimedia.org/r/831889 (https://phabricator.wikimedia.org/T295690) [14:26:06] (03CR) 10MacFan4000: [C: 03+1] ExtensionDistributor: Add REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925) (owner: 10Jforrester) [14:27:07] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/831889 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [14:28:19] (03CR) 10Klausman: [C: 03+1] admin_ng: set more values for Istio DR in ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/831876 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [14:29:22] PROBLEM - MariaDB read only db_inventory #page on db2093 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.26-MariaDB-log, Uptime 6574s, event_scheduler: True, 285.80 QPS, connection latency: 0.004492s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:29:36] * volans here [14:29:37] is that maintenance?= [14:29:41] (03CR) 10Cathal Mooney: [C: 03+2] Re-pool codfw after upgrading core routers on site [dns] - 10https://gerrit.wikimedia.org/r/831889 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [14:29:43] downtime expired maybe? [14:29:48] rebooted [14:29:48] should cause no impact [14:29:50] grrr [14:29:50] * Emperor here [14:29:52] uptime 1:52 [14:29:52] yeah [14:29:55] fixing it [14:30:00] TY [14:30:04] did it crash? [14:30:09] nop [14:31:15] !re-pooling codfw on authdns after router upgrades completed. [14:32:02] (03CR) 10Ssingh: [C: 03+1] prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [14:33:07] I am actually going to disable notifications for that host, it shouldn't have them enabled anymore [14:33:16] It is no longer active orchestrator db master [14:33:22] So it shouldn't create noise like this [14:33:47] +1 [14:33:50] RECOVERY - MariaDB read only db_inventory #page on db2093 is OK: Version 10.4.26-MariaDB-log, Uptime 6842s, read_only: True, event_scheduler: True, 70.98 QPS, connection latency: 0.004412s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:35:04] marostegui: I saw some host that were paging but probably shouldn't on misc, this was one of them [14:35:17] but there were others (pasive misc hosts) [14:35:48] jynus: db2078 perhaps? [14:35:50] it is ok if they alerted without paging [14:35:57] let me see [14:36:13] (03PS1) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891 [14:36:33] that host doesn't exist anymore, right? [14:36:51] (03CR) 10CI reject: [V: 04-1] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891 (owner: 10Marostegui) [14:36:59] yeah, I am asking in case you saw that one some time ago [14:37:59] MariaDB read only m1 [14:38:28] (03PS2) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891 [14:38:49] jynus: not sure what that check is and what it comes from? [14:38:51] (03PS1) 10Volans: Release v0.6.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831892 [14:38:52] which host was that? [14:38:53] ok to deploy that, but probably better disabling paging [14:39:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: analytics-reportupdater-logs-rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:34] db2132 db2133 db2134 db2135 [14:39:36] (03CR) 10Marostegui: [C: 03+2] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831891 (owner: 10Marostegui) [14:40:12] meaning probably a deeper review has to be done (but doesn't have to happen now) [14:40:28] to disable pages on non critical servers [14:40:34] jynus: so what's wrong with those hosts? [14:40:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) cr1-codfw and cr2-codfw sucessfully upgraded today. Took a while with the firmware upgrades too, I've added some notes [[https://wikitech.wikimedia.o... [14:41:04] marostegui: as far as I understand, they page but have no user traffic (they are misc) [14:41:18] it is ok for them to alert, but paging may be too much [14:41:33] jynus: ah ok, yeah. I don't think they should even send IRC notifications, icinga should be enough [14:41:36] I will check them tomorrow [14:41:55] yeah, just noticing that, no rush now [14:42:27] I can just do profile::monitoring::is_critical: false for them [14:42:32] I will check tomorrow [14:42:32] (03CR) 10BCornwall: [C: 03+2] prometheus: Remove quotes from ATS config gauge [puppet] - 10https://gerrit.wikimedia.org/r/831624 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [14:42:38] yeah, other day with more time :-) [14:42:40] jouncebot now [14:42:40] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [14:42:53] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:43:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:28] !log installing libxslt security updates on buster [14:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:00] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831895 (https://phabricator.wikimedia.org/T314190) [14:46:02] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831895 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [14:46:33] !log restarting FPM/Apache on mediawiki canaries [14:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:44] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831895 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [14:47:01] !log dancy@deploy1002 prep aborted: (duration: 00m 12s) [14:47:01] !log dancy@deploy1002 deploy-promote aborted: (duration: 01m 03s) [14:47:11] moritzm: Lemme know when you're done please. [14:47:21] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) @Volans still seems to have a issue ` Traceback (most recent call last): File "/usr/lib/python3/dist-packages/wmflib/config.py", line 3... [14:49:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:50:05] (03PS5) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 [14:50:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:50:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:50:20] (03PS3) 10Jbond: P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868 [14:50:26] (03PS6) 10Jbond: P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 [14:51:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:54:04] dancy: you can proceed, I'll do the rest when the deployments are complete [14:54:18] Thanks! My part should take about 3 minutes [14:54:45] !log dancy@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.1 refs T314190 [14:54:48] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [14:55:51] (03PS2) 10KartikMistry: Enable Section Translation in Odia Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831872 (https://phabricator.wikimedia.org/T313300) [14:56:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P34622 and previous config saved to /var/cache/conftool/dbconfig/20220913-145631-ladsgroup.json [14:56:35] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:59:29] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.1 refs T314190 (duration: 04m 43s) [14:59:40] moritzm: Back atcha [15:00:50] dancy: cheers, I'll resume [15:01:17] (03CR) 10BCornwall: varnish/tests: Remove extraneous test checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [15:02:05] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.6.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831892 (owner: 10Volans) [15:08:36] !log dancy@deploy1002 deploy-promote aborted: (duration: 00m 02s) [15:09:20] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831903 (https://phabricator.wikimedia.org/T314190) [15:09:22] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831903 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [15:10:10] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831903 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [15:10:23] !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.28 refs T314190 [15:10:26] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [15:11:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P34623 and previous config saved to /var/cache/conftool/dbconfig/20220913-151138-ladsgroup.json [15:12:11] !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.0 - volans@cumin1001 [15:13:20] (03PS1) 10Btullis: Failover hive to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/831906 (https://phabricator.wikimedia.org/T311807) [15:13:51] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.0 - volans@cumin1001 [15:14:23] (03Abandoned) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) (owner: 10Jdlrobson) [15:14:55] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.28 refs T314190 (duration: 04m 31s) [15:16:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:17:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:17:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:17:34] (03CR) 10Jbond: [C: 03+2] P:spicerack: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/831868 (owner: 10Jbond) [15:17:38] (03CR) 10Jbond: [C: 03+2] P:spicerack: add firmware directory [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond) [15:17:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,eventlogging_to_druid_prefupdate_hourly.service,refine_event_sanitized_analytics_immediate.service,refine_event_sanitiz [15:17:49] immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:50] (03CR) 10Jbond: [C: 03+2] P:spicerack: add firmware directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831869 (owner: 10Jbond) [15:18:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:18:56] (03CR) 10Btullis: [C: 03+2] Failover hive to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/831906 (https://phabricator.wikimedia.org/T311807) (owner: 10Btullis) [15:23:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:25:53] (03PS1) 10Muehlenhoff: wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 [15:26:07] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:26:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P34624 and previous config saved to /var/cache/conftool/dbconfig/20220913-152644-ladsgroup.json [15:30:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:30:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:31:38] (03CR) 10CI reject: [V: 04-1] wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [15:34:10] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: heavily sample k8s proxy/httpd logs [puppet] - 10https://gerrit.wikimedia.org/r/831626 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [15:36:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:38:25] (03PS9) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 [15:40:32] (03PS10) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 [15:40:37] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@031604d]: Automatically drop hitsorical partitions of subgraph analysis [15:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P34625 and previous config saved to /var/cache/conftool/dbconfig/20220913-154151-ladsgroup.json [15:41:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:41:54] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:41:58] (03CR) 10BCornwall: varnish/tests: Remove extraneous test checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [15:42:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:42:45] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@031604d]: Automatically drop hitsorical partitions of subgraph analysis (duration: 02m 07s) [15:42:58] (03PS1) 10Hashar: gerrit: disable automatic plugin handling [puppet] - 10https://gerrit.wikimedia.org/r/831913 (https://phabricator.wikimedia.org/T317412) [15:47:09] PROBLEM - Host db1189 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:47:17] here [15:47:27] a crash maybe? [15:47:27] woot [15:47:31] maybe [15:47:38] marostegui: you depool? [15:47:49] yes [15:48:05] checking impact meanwhile [15:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1189', diff saved to https://phabricator.wikimedia.org/P34626 and previous config saved to /var/cache/conftool/dbconfig/20220913-154810-root.json [15:48:20] letting the two of you run things but I'm here to help if needed :) [15:48:34] "Wikimedia\Rdbms\LoadMonitor::computeServerStates: host db1189 is unreachable" but that is to be expected [15:48:56] it should stop receiving connections and just error out on the retries [15:49:09] RECOVERY - Host db1189 #page is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:49:16] it got rebooted [15:50:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: down [15:50:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: down [15:50:34] 10.4, so shouldn't be an issue [15:50:43] Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_A10. Immediately replace the DIMM. [15:50:46] Memory [15:50:50] I will create a task [15:51:01] (03PS1) 10Hashar: gerrit: scap checks script to automatize deployment [puppet] - 10https://gerrit.wikimedia.org/r/831916 (https://phabricator.wikimedia.org/T317412) [15:51:18] not the master candidate, so we should be ok without it [15:51:19] * jbond also here if needed [15:51:47] no more errors now [15:52:07] (03CR) 10CI reject: [V: 04-1] gerrit: scap checks script to automatize deployment [puppet] - 10https://gerrit.wikimedia.org/r/831916 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [15:52:14] 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) [15:52:39] 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) p:05Triageβ†’03Medium [15:53:58] marostegui, jynus: thanks! [15:54:23] 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Started mysql for now. Will do a data check but will leave the host depooled. @Cmjohnson @Jclark-ctr once the DIMM is received and ready to be replaced, please let us know so we can power off the host for you. [15:55:03] won't create an outage report, as even if we did nothing, there would be almost no user impact, just the summary on the handover doc [15:55:17] resolved in VO [15:55:31] only on-the-fly (read only) queries get affected + monitoring spam [15:56:17] handling it is also very well documented: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica [15:56:48] (so mw doesn't keep retrying connecting and alerting) [15:56:49] 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10wiki_willy) a:05wiki_willyβ†’03Cmjohnson [15:57:53] 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10wiki_willy) @Cmjohnson - just a heads up, this was just recently installed, so it's under warranty for submitting a RMA with Dell. Thanks, Willy [15:59:40] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: read firmware_store from config [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 [15:59:42] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: create subfolderes for firmware type [cookbooks] - 10https://gerrit.wikimedia.org/r/831920 [16:00:05] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:04:42] (03PS1) 10Marostegui: db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831922 (https://phabricator.wikimedia.org/T317662) [16:05:21] (03CR) 10Marostegui: [C: 03+2] db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/831922 (https://phabricator.wikimedia.org/T317662) (owner: 10Marostegui) [16:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T314041)', diff saved to https://phabricator.wikimedia.org/P34628 and previous config saved to /var/cache/conftool/dbconfig/20220913-160536-ladsgroup.json [16:05:40] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [16:07:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance [16:07:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Maintenance [16:07:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet with reason: Maintenance [16:07:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet with reason: Maintenance [16:09:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:09:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:11:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:09] !log add 200G to prometheus/eqiad instance ops [16:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P34629 and previous config saved to /var/cache/conftool/dbconfig/20220913-162043-ladsgroup.json [16:31:04] (03PS2) 10Muehlenhoff: wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 [16:33:44] (03CR) 10Elukey: [C: 03+2] admin_ng: set more values for Istio DR in ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/831876 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [16:34:43] (03CR) 10Cwhite: [C: 03+1] role::kafka::logging: move kafka on all codfw nodes to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [16:34:45] (03CR) 10CI reject: [V: 04-1] wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [16:35:02] (03PS11) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 [16:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P34630 and previous config saved to /var/cache/conftool/dbconfig/20220913-163549-ladsgroup.json [16:35:55] (03PS3) 10Muehlenhoff: wcqs/wdqs: New cookbook to perform rolling restart of Nginx [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 [16:36:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:36:20] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:36:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:36:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:37:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:37:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:39:11] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:47:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T312863)', diff saved to https://phabricator.wikimedia.org/P34631 and previous config saved to /var/cache/conftool/dbconfig/20220913-164734-ladsgroup.json [16:47:38] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [16:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T314041)', diff saved to https://phabricator.wikimedia.org/P34632 and previous config saved to /var/cache/conftool/dbconfig/20220913-165056-ladsgroup.json [16:50:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:51:00] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [16:51:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [16:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T314041)', diff saved to https://phabricator.wikimedia.org/P34633 and previous config saved to /var/cache/conftool/dbconfig/20220913-165117-ladsgroup.json [16:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34634 and previous config saved to /var/cache/conftool/dbconfig/20220913-165202-ladsgroup.json [17:02:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34635 and previous config saved to /var/cache/conftool/dbconfig/20220913-170241-ladsgroup.json [17:07:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P34636 and previous config saved to /var/cache/conftool/dbconfig/20220913-170708-ladsgroup.json [17:14:45] (03PS1) 10Hashar: gerrit: ignore lint error in role [puppet] - 10https://gerrit.wikimedia.org/r/831932 [17:14:47] (03PS1) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 [17:15:42] (03CR) 10CI reject: [V: 04-1] gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [17:16:48] (03CR) 10Herron: "Masking (bulk?) mail originating from a 3rd party system as our own has risks. High volume or problematic content could cause deliverabil" [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [17:17:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P34637 and previous config saved to /var/cache/conftool/dbconfig/20220913-171747-ladsgroup.json [17:19:55] (03PS2) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 [17:22:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P34638 and previous config saved to /var/cache/conftool/dbconfig/20220913-172215-ladsgroup.json [17:22:28] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [17:24:56] (03PS4) 10Ryan Kemper: wcqs/wdqs: New rolling restart nginx cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [17:26:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:10] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/pcc-worker1003/1436/" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [17:29:16] (03PS1) 10Volans: wmf-netbox plugin: fix pynetbox issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831936 [17:32:46] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831936 (owner: 10Volans) [17:32:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T312863)', diff saved to https://phabricator.wikimedia.org/P34639 and previous config saved to /var/cache/conftool/dbconfig/20220913-173254-ladsgroup.json [17:32:58] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [17:36:59] (03CR) 10Volans: [V: 03+2 C: 03+2] wmf-netbox plugin: fix pynetbox issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831936 (owner: 10Volans) [17:37:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T314041)', diff saved to https://phabricator.wikimedia.org/P34640 and previous config saved to /var/cache/conftool/dbconfig/20220913-173721-ladsgroup.json [17:37:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:37:25] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:37:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:40:21] (03CR) 10BCornwall: admin: Add Hannah Okwelum to analytics-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall) [17:40:55] (03PS1) 10Volans: Deploy fix for the wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831937 [17:41:42] !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Upgrade wmf-netbox plugin - volans@cumin1001 [17:43:19] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Upgrade wmf-netbox plugin - volans@cumin1001 [17:44:24] (03Abandoned) 10Volans: Deploy fix for the wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/831937 (owner: 10Volans) [17:46:00] (03PS3) 10Aishik Rehman: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) [17:46:07] (03PS1) 10Ryan Kemper: elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) [17:47:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34642 and previous config saved to /var/cache/conftool/dbconfig/20220913-174718-ladsgroup.json [17:47:22] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:47:23] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 130 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:51:37] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) @Milimetric Thanks for the reply and for expanding the docs. I think that Wikitech is a more appropriate place for documentation of the group th... [17:51:59] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 125 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:52:19] (03CR) 10Dzahn: [C: 03+1] admin: Add Hannah Okwelum to analytics-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall) [17:52:38] (03CR) 10BCornwall: [C: 03+2] admin: Add Hannah Okwelum to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall) [17:52:44] (03PS2) 10BCornwall: admin: Add Hannah Okwelum to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/831622 (https://phabricator.wikimedia.org/T317545) [17:53:54] The mediawiki errors are https://phabricator.wikimedia.org/T317606 [17:54:17] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 24 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:56:14] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) @Jclark-ctr just to avoid misunderstanding, did you run it with sudo? ` sudo secure-cookbook -d sre.dns.netbox "noop" ` [17:57:39] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:58:53] (03PS1) 10MusikAnimal: InitialiseSettings-labs.php: Set $wgPhonosPath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831941 (https://phabricator.wikimedia.org/T317417) [18:00:04] dancy and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T1800). [18:00:30] The train is blocked on https://phabricator.wikimedia.org/T317606 [18:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P34643 and previous config saved to /var/cache/conftool/dbconfig/20220913-180225-ladsgroup.json [18:02:57] (03PS1) 10Cwhite: bugfixes [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 [18:02:59] (03PS1) 10Cwhite: add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) [18:03:06] (03CR) 10CI reject: [V: 04-1] bugfixes [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 (owner: 10Cwhite) [18:03:08] (03CR) 10CI reject: [V: 04-1] add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite) [18:05:05] (03PS1) 10Dduvall: scap: Remove use of --preserve-env for sudo'd scripts [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/831944 (https://phabricator.wikimedia.org/T313953) [18:05:07] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:51] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:06:14] (03PS4) 10Aishik Rehman: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) [18:06:15] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:37] (03PS2) 10Cwhite: bugfix: pin markupsafe to compatible version 2.0.1 [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 [18:06:39] (03PS2) 10Cwhite: add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) [18:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:07:56] (03CR) 10Cwhite: [C: 03+2] bugfix: pin markupsafe to compatible version 2.0.1 [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 (owner: 10Cwhite) [18:08:29] (03Merged) 10jenkins-bot: bugfix: pin markupsafe to compatible version 2.0.1 [software/ecs] - 10https://gerrit.wikimedia.org/r/831942 (owner: 10Cwhite) [18:08:31] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) That was without sudo. With Sudo still requires password [18:10:27] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 34 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:11:55] (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 (owner: 10Jbond) [18:11:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:12:53] (03CR) 10Volans: [C: 03+1] "LGTM, it just need a manual clean/move of the existing files once deployed" [cookbooks] - 10https://gerrit.wikimedia.org/r/831920 (owner: 10Jbond) [18:13:09] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:15:23] 10SRE-swift-storage, 10Commons, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-File-management, and 3 others: Mediawiki sometimes displays old image revision despite purge and hard refresh - https://phabricator.wikimedia.org/T317481 (10aaron) [18:17:00] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:17:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P34644 and previous config saved to /var/cache/conftool/dbconfig/20220913-181731-ladsgroup.json [18:17:44] (03PS5) 10Aishik Rehman: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) [18:19:26] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) Ahhh I think I know what happened here, it's the dry-run option. Try to run it for real: ` sudo secure-cookbook sre.dns.netbox "noop" ` @jbon... [18:19:39] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:21:57] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:24:40] Hi, I'm going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/831941/, which only touches the beta cluster (`wmf-config/InitialiseSettings-labs.php`) β€” any pressing reasons why I shouldn't? I believe the current window for the train isn't happening because it's blocked [18:25:04] Go for it. [18:25:10] :) [18:25:44] (03CR) 10Samtar: [C: 03+2] "Beta cluster deploy, no-op for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831941 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [18:26:28] (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Set $wgPhonosPath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831941 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [18:27:11] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) >>! In T317545#8233696, @BCornwall wrote: > I think that Wikitech is a more appropriate place for documentation of the group than the codebase:... [18:27:49] (03PS1) 10Ssingh: P:wikidough: remove redundant resource absentees [puppet] - 10https://gerrit.wikimedia.org/r/831946 [18:28:35] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37247/console" [puppet] - 10https://gerrit.wikimedia.org/r/831946 (owner: 10Ssingh) [18:28:49] !log deploying a beta cluster only config change, T317417 [18:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:53] T317417: Phonos links to unroutable domain/URL for the MP3 file - https://phabricator.wikimedia.org/T317417 [18:29:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:29:57] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [18:31:35] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:31:48] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:831941|InitialiseSettings-labs.php: Set $wgPhonosPath (T317417)]] (duration: 03m 45s) [18:32:08] (done, thanks) [18:32:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34645 and previous config saved to /var/cache/conftool/dbconfig/20220913-183238-ladsgroup.json [18:32:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:32:45] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:32:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:33:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34646 and previous config saved to /var/cache/conftool/dbconfig/20220913-183259-ladsgroup.json [18:33:25] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: remove redundant resource absentees [puppet] - 10https://gerrit.wikimedia.org/r/831946 (owner: 10Ssingh) [18:33:55] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) @ottomata or @elukey I'm under the impression that one of you would be the best person to handle the Kerberos access. If that's true, would you be kind enough to prov... [18:36:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:36:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:36:42] (03PS1) 10Cwhite: logstash: expand ecs pre and post filter gates [puppet] - 10https://gerrit.wikimedia.org/r/831949 (https://phabricator.wikimedia.org/T292585) [18:38:09] (03PS1) 10Volans: cli: add --version option [software/homer] - 10https://gerrit.wikimedia.org/r/831951 [18:38:28] (03PS2) 10Ryan Kemper: elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) [18:38:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan AkgΓΌn (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) 05Openβ†’03In progress p:05Triageβ†’03Medium a:03BCornwall Hi! Thanks for the request. Could I get Hasan's shell username? I'm unable to find that information. [18:39:27] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37248/console" [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) (owner: 10Ryan Kemper) [18:40:41] (03PS3) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 [18:41:01] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [18:41:21] (03PS1) 10Cwhite: logstash: migrate mediawiki_ecs to ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/831952 (https://phabricator.wikimedia.org/T314098) [18:41:28] (03CR) 10Hashar: "Patchset 3 moves the Apache templates from the Gerrit module to the profile." [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [18:42:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:43:07] (03CR) 10Bking: [C: 03+1] elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) (owner: 10Ryan Kemper) [18:43:12] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: upgrade eqiad elasticsearch to 7.10.2 [puppet] - 10https://gerrit.wikimedia.org/r/831938 (https://phabricator.wikimedia.org/T317686) (owner: 10Ryan Kemper) [18:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:46:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10BCornwall) 05Openβ†’03In progress p:05Triageβ†’03Medium a:03BCornwall Hi, Tanuja! I'll need approval from your manager before proceeding. Could you tag them here, please? [18:46:09] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:46:55] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686 [18:46:58] T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686 [18:47:50] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686 [18:50:31] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10BCornwall) p:05Triageβ†’03Medium a:03BCornwall [18:52:56] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10BCornwall) p:05Triageβ†’03Medium a:03ayounsi @ayounsi as you are a bureaucrat on wikitech, I choose you for the privilege of renaming! (Thanks for doing that if you can!) [18:54:23] (03PS1) 10MusikAnimal: rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) [18:55:00] (03CR) 10CI reject: [V: 04-1] rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [19:00:26] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10taavi) @ayounsi @bcornwall Please don't if you don't fully understand the effects of an account rename on all the systems that use developer account/LDAP authentication. We don't usually... [19:00:41] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:00:45] (03PS2) 10MusikAnimal: rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) [19:01:53] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686 [19:01:57] T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686 [19:08:03] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BTullis) >>! In T317545#8233883, @BCornwall wrote: > @ottomata or @elukey I'm under the impression that one of you would be the best person to handle the Kerberos access. If tha... [19:16:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T314041)', diff saved to https://phabricator.wikimedia.org/P34647 and previous config saved to /var/cache/conftool/dbconfig/20220913-191632-ladsgroup.json [19:16:37] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:17:36] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10BCornwall) a:05ayounsiβ†’03None [19:17:40] (03CR) 10Ryan Kemper: [C: 03+1] "I think a batch size of 1 with a 1 second delay is probably fine, given how fast nginx comes back up." [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [19:17:50] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10BCornwall) 05Openβ†’03Invalid Thanks for the information, @taavi. I missed the banner since the given link was an anchor, TBH. Given this, I think it's safe to close this as invalid. [19:18:07] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-eqiad.service,elasticsearch_7@production-search-omega-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:09] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-eqiad.service,elasticsearch_7@production-search-omega-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:43] PROBLEM - Check systemd state on elastic1087 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-eqiad.service,elasticsearch_7@production-search-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:11] ^^ ryankemper I'm looking at 1080 now [19:19:34] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686 [19:19:38] T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686 [19:20:34] inflatador: we ran puppet across the whole fleet once, perhaps we should have ran it twice [19:21:42] ryankemper ACK, continuing convo in search [19:28:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan AkgΓΌn (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) [19:28:17] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T317621 (10BCornwall) [19:31:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P34648 and previous config saved to /var/cache/conftool/dbconfig/20220913-193139-ladsgroup.json [19:32:46] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10WMDE-leszek) @taavi I am not sure I have understood the reasoning fully. The request is about removing the WMDE suffix that Hasan has accidentally included? Or is your suggestion to not r... [19:34:30] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10Dzahn) Creating a new account will be MUCH easier than renaming. And if has just recently been created and therefore not much history then it especially makes the most sense to simply cre... [19:35:44] (03CR) 10JHathaway: mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [19:45:51] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P34649 and previous config saved to /var/cache/conftool/dbconfig/20220913-194645-ladsgroup.json [19:47:09] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10WMDE-leszek) alright thanks @Dzahn. I somehow managed to miss the topmost banner as well. [19:51:07] RECOVERY - Check systemd state on elastic1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:29] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686 [19:55:33] T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686 [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220913T2000). [20:00:05] Aishik: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:29] o/ [20:00:32] i can deploy [20:00:37] woo [20:00:40] ^^ [20:00:42] lol [20:01:11] πŸ™‚ TheresNoTime [20:01:36] Your 'emoji' [20:01:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T314041)', diff saved to https://phabricator.wikimedia.org/P34650 and previous config saved to /var/cache/conftool/dbconfig/20220913-200152-ladsgroup.json [20:01:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:01:56] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:01:59] hi Aishik: getting started with your patch [20:02:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:02:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T314041)', diff saved to https://phabricator.wikimedia.org/P34651 and previous config saved to /var/cache/conftool/dbconfig/20220913-200214-ladsgroup.json [20:02:24] (03PS6) 10Clare Ming: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman) [20:03:04] (03CR) 10CI reject: [V: 04-1] add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman) [20:03:14] (03PS1) 10Gmodena: charts:eventstreams bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/831957 (https://phabricator.wikimedia.org/T292390) [20:03:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [20:03:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [20:03:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T314041)', diff saved to https://phabricator.wikimedia.org/P34652 and previous config saved to /var/cache/conftool/dbconfig/20220913-200344-ladsgroup.json [20:03:57] Aishik: CI needs a newline -- can you take care of that? otherwise i can push up a quick fix [20:04:21] Do it please [20:04:42] np [20:05:52] (: [20:06:18] (03PS7) 10Clare Ming: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman) [20:06:57] PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:22] (03CR) 10Clare Ming: [C: 03+2] add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman) [20:07:37] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:09] (03Merged) 10jenkins-bot: add tagline and update wordmark in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman) [20:08:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831223 (https://phabricator.wikimedia.org/T313174) (owner: 10Aishik Rehman) [20:09:03] !log cjming@deploy1002 Started scap: Backport for [[gerrit:831223|add tagline and update wordmark in ptwikinews (T313174)]] [20:09:06] T313174: add tagline and wordmark in ptwikinews - https://phabricator.wikimedia.org/T313174 [20:09:24] !log cjming@deploy1002 cjming and aishik: Backport for [[gerrit:831223|add tagline and update wordmark in ptwikinews (T313174)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:09:26] Aishik: can you verify on mwdebug? [20:10:24] tagline is working! [20:10:30] yay - going live [20:10:40] but not the wordmark! [20:11:27] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:34] hmm - whoops - it might need to be purged - i already started the sync [20:13:45] 😴 [20:13:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:26] TheresNoTime: i forget - do run the purgeList script on the deployment server? [20:14:53] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:831223|add tagline and update wordmark in ptwikinews (T313174)]] (duration: 05m 50s) [20:14:56] T313174: add tagline and wordmark in ptwikinews - https://phabricator.wikimedia.org/T313174 [20:15:02] so something like: "echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikinews-tagline-pt.svg' | mwscript purgeList.php"? [20:15:22] cjming: nae, https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging - on `mwmaint` :) [20:15:34] ah - thanks [20:16:44] Aishik: just purged that svg - can you check on prod? [20:17:30] Wordmark is not working yet... [20:17:46] gah - purged the wrong svg - 1 sec [20:17:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:18:04] (03PS3) 10Cwhite: add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) [20:18:26] Aishik: how about now? [20:18:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:13] Nope! [20:21:40] ... [20:21:57] The wordmark is the thing you see in the left up corner with the new vector skin, isn't it? [20:22:02] hmm - not sure what to say about that - maybe it takes a minute? [20:22:39] @zabe yeap! [20:22:54] vector 2022 skin [20:22:56] Aishik, could you try clearing you browser cache? [20:23:29] It's working! [20:23:34] yay! [20:23:43] thanks zabe - it's always cache [20:23:51] Thank you (: [20:24:21] alrighty closing the backport window seeing there's nothing else in the queue [20:25:00] yw [20:25:04] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) ` jclark@cumin1001:~$ sudo secure-cookbook sre.dns.netbox "noop" We trust you have received the usual lecture from the local System Admin... [20:25:41] !log end of UTC late backport window [20:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:01] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, phab1004, releases1002, thanos-fe1002, thanos-fe1003, t [20:27:01] 2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [20:29:29] (03PS1) 10Hashar: gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963 [20:30:23] (03CR) 10CI reject: [V: 04-1] gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [20:34:49] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, phab1004, releases1002, thanos-fe1002, thanos-fe1003, t [20:34:49] 2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [20:48:31] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:01:08] (03PS1) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) [21:01:21] (03PS2) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) [21:01:22] jouncebot nowandnext [21:01:22] No deployments scheduled for the next 9 hour(s) and 58 minute(s) [21:01:22] In 9 hour(s) and 58 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T0700) [21:03:46] (03PS3) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) [21:04:22] (03CR) 10CI reject: [V: 04-1] phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [21:04:32] !log dancy@deploy1002 Started scap: testing T299648 [21:04:36] T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 [21:06:16] (03PS1) 10Volans: admin: fix sudo permission for datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/831987 (https://phabricator.wikimedia.org/T306654) [21:07:11] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Patch-For-Review: Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) @Jclark-ctr whoops, that's not wha't supposed to happen. On second review I think that the original patch has an error,... [21:08:37] (03PS4) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) [21:10:33] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:12:54] (03PS5) 10Dduvall: phabricator: Fix sudo env_keep format [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) [21:13:52] (03CR) 10Brennen Bearnes: [C: 03+1] "Makes sense to me. Merge at will I'd say." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/831944 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:14:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:14:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:14:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:14:32] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:14:51] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:14:54] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:15:10] !log dancy@deploy1002 dancy: testing T299648 synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:15:13] T299648: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 [21:15:44] (03CR) 10Dduvall: [V: 03+1] "Sorry for the noise. Manually verified in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [21:16:05] !log dancy@deploy1002 Sync cancelled. [21:16:42] !log dancy@deploy1002 touch /var/lib/deploy-mwdebug/pause [21:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:09] (03CR) 10Dduvall: [C: 03+2] "Thanks! Verified in devtools. This won't work until I09cb4161712257f27999bc322a1bd80206afe82a is merged but the deployment doesn't work cu" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/831944 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [21:18:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:36:51] !log dancy@deploy1002 Started scap: testing [21:37:12] !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 GIT_BASE=https://gerrit.wikimedia.org/r/ MW_CONFIG_BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restric [21:37:12] ted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver MV_BASE_PACKAGES= MV_EXTRA_CA_CERT=' returned non-zero exit status 2. (duration: 00m 20s) [21:39:13] (03CR) 10Herron: mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [21:47:36] !log dancy@deploy1002 Started scap: testing [21:48:18] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:50:37] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:50:53] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:54:47] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:54:54] !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:55:08] !log dancy@deploy1002 Sync cancelled. [21:55:17] !log dancy@deploy1002 Started scap: testing [21:55:58] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:56:52] (03CR) 10JHathaway: mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [21:58:39] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:58:55] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:01:03] (03PS1) 10RLazarus: httpbb: In PHP version routing tests, allow either 7.2 or 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/831997 [22:01:16] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:01:32] !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:01:35] !log dancy@deploy1002 Sync cancelled. [22:02:36] !log dancy@deploy1002 Started scap: testing [22:03:15] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:05:59] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:06:07] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:06:33] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:06:45] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:07:02] !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:07:07] !log dancy@deploy1002 Sync cancelled. [22:07:46] !log dancy@deploy1002 Started scap: testing [22:08:26] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:10:49] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:11:05] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:11:50] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:11:59] !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [22:12:07] !log dancy@deploy1002 Sync cancelled. [22:12:45] !log dancy@deploy1002 Started scap: testing [22:13:00] Sorry for the noise. I think this will be the last run for the day. [22:13:25] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:13:35] (03CR) 10Dzahn: "wow, great idea to use the validate_command with file. thanks for that. will get to it soon! currently a bit afk" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [22:14:05] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:14:12] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:14:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:14:54] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:15:01] !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:15:04] !log dancy@deploy1002 Sync cancelled. [22:15:09] (03PS2) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) [22:16:06] !log dancy@deploy1002$ rm /var/lib/deploy-mwdebug/pause [22:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:00] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34653 and previous config saved to /var/cache/conftool/dbconfig/20220913-221734-ladsgroup.json [22:17:39] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:19:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:19:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:19:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:19:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:27:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T314041)', diff saved to https://phabricator.wikimedia.org/P34654 and previous config saved to /var/cache/conftool/dbconfig/20220913-222738-ladsgroup.json [22:27:42] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:30:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:30:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:30:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T314041)', diff saved to https://phabricator.wikimedia.org/P34655 and previous config saved to /var/cache/conftool/dbconfig/20220913-223025-ladsgroup.json [22:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P34656 and previous config saved to /var/cache/conftool/dbconfig/20220913-223241-ladsgroup.json [22:32:50] (03PS15) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [22:34:19] (03CR) 10Raymond Ndibe: "Hello David, the new tests you added are failing because of the sudo command we are using. I'm currently looking for a way to fix this" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:36:44] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:39:39] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:40:11] (03CR) 10Dzahn: [C: 03+2] "compiled and confirmed with manual visudo on phab2001, disabled puppet on other hosts" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [22:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P34657 and previous config saved to /var/cache/conftool/dbconfig/20220913-224244-ladsgroup.json [22:43:26] (03PS1) 10Andrew Bogott: toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000 [22:43:28] (03PS1) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) [22:43:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:44:03] (03CR) 10CI reject: [V: 04-1] toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000 (owner: 10Andrew Bogott) [22:45:27] (03CR) 10Dzahn: [C: 03+2] "root@phab2001:/etc/sudoers.d# su phab-deploy" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [22:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P34658 and previous config saved to /var/cache/conftool/dbconfig/20220913-224749-ladsgroup.json [22:49:24] (03PS2) 10Andrew Bogott: toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000 [22:49:26] (03PS2) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) [22:57:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P34659 and previous config saved to /var/cache/conftool/dbconfig/20220913-225750-ladsgroup.json [23:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T314041)', diff saved to https://phabricator.wikimedia.org/P34660 and previous config saved to /var/cache/conftool/dbconfig/20220913-230026-ladsgroup.json [23:00:30] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:01:39] (03CR) 10Dzahn: [C: 03+2] "@dduval it works. I tested it like this:" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:02:49] (03CR) 10Dduvall: [V: 03+1] phabricator: Fix sudo env_keep format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:02:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T314041)', diff saved to https://phabricator.wikimedia.org/P34661 and previous config saved to /var/cache/conftool/dbconfig/20220913-230255-ladsgroup.json [23:02:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [23:03:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [23:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T314041)', diff saved to https://phabricator.wikimedia.org/P34662 and previous config saved to /var/cache/conftool/dbconfig/20220913-230317-ladsgroup.json [23:06:09] (03CR) 10Andrew Bogott: [C: 03+2] toolviews.py: run through black in advance of some changes [puppet] - 10https://gerrit.wikimedia.org/r/832000 (owner: 10Andrew Bogott) [23:10:23] (03CR) 10Dzahn: [C: 03+2] "deployed on all phab servers now" [puppet] - 10https://gerrit.wikimedia.org/r/831965 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [23:12:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T314041)', diff saved to https://phabricator.wikimedia.org/P34663 and previous config saved to /var/cache/conftool/dbconfig/20220913-231257-ladsgroup.json [23:12:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:13:03] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:13:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34664 and previous config saved to /var/cache/conftool/dbconfig/20220913-231533-ladsgroup.json [23:19:17] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:30:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34665 and previous config saved to /var/cache/conftool/dbconfig/20220913-233039-ladsgroup.json [23:45:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T314041)', diff saved to https://phabricator.wikimedia.org/P34666 and previous config saved to /var/cache/conftool/dbconfig/20220913-234546-ladsgroup.json [23:45:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [23:45:50] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:46:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [23:46:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34667 and previous config saved to /var/cache/conftool/dbconfig/20220913-234607-ladsgroup.json [23:47:35] (03PS3) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714)