[00:00:04] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [00:05:20] RECOVERY - Disk space on ml-staging-ctrl2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [00:07:00] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: man-db.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:00] RECOVERY - Disk space on ml-staging-ctrl2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:44] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:01:59] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:24:37] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:39:03] 10SRE, 10Wikimedia-Apache-configuration, 10Developer Productivity, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners) - https://phabricator.wikimedia.org/T190111 (10Krinkle) 05Open→03Resolved a:03Krinkle The current approach I docum... [02:40:13] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:44:05] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [02:46:21] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles - https://phabricator.wikimedia.org/T116132 (10Krinkle) I think the bandwidth congestion risk is sufficient to not go this route. It seems likely to result in performan... [03:25:41] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:00:45] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [04:09:43] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:13] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:27:29] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:14:29] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [08:55:41] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 10.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:20:19] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:24:39] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:33:51] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 9.564 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [10:40:29] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [11:46:31] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:48:43] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:59:55] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [13:15:31] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:38:19] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:40:35] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 10 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:16:43] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:23:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:31] (03PS1) 10Zabe: Migrate $wmfConfigDir to $wmgConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) [16:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [16:13:34] (03PS4) 10Majavah: P:wmcs::paws::prometheus: add kubernetes prometheus jobs [puppet] - 10https://gerrit.wikimedia.org/r/778622 (https://phabricator.wikimedia.org/T304716) [16:18:02] (03PS5) 10Majavah: P:wmcs::paws::prometheus: add kubernetes prometheus jobs [puppet] - 10https://gerrit.wikimedia.org/r/778622 (https://phabricator.wikimedia.org/T304716) [16:22:32] (03PS1) 10Majavah: P:toolforge::prometheus: remove paws jobs [puppet] - 10https://gerrit.wikimedia.org/r/778673 (https://phabricator.wikimedia.org/T304716) [16:31:46] (03PS1) 10Majavah: hieradata: make grafana-cloud the preferred hostname [puppet] - 10https://gerrit.wikimedia.org/r/778674 [16:32:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34762/console" [puppet] - 10https://gerrit.wikimedia.org/r/778674 (owner: 10Majavah) [16:53:03] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 15.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:59:39] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [17:37:17] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 16.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [17:44:01] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [18:02:21] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 189 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:04:37] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:21:41] PROBLEM - MariaDB Replica Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1287.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:21:49] PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1299.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:21:53] PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1298.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:22:51] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:23:15] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1380.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:27:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:38:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:38:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24339 and previous config saved to /var/cache/conftool/dbconfig/20220410-193900-ladsgroup.json [19:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:45:19] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [20:24:01] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:35:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24340 and previous config saved to /var/cache/conftool/dbconfig/20220410-203508-ladsgroup.json [20:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:37:35] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24341 and previous config saved to /var/cache/conftool/dbconfig/20220410-205014-ladsgroup.json [20:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:23] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24342 and previous config saved to /var/cache/conftool/dbconfig/20220410-210519-ladsgroup.json [21:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:01] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:03] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24343 and previous config saved to /var/cache/conftool/dbconfig/20220410-212024-ladsgroup.json [21:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [21:20:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:20:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [21:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24344 and previous config saved to /var/cache/conftool/dbconfig/20220410-212042-ladsgroup.json [21:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:05] RECOVERY - MariaDB Replica Lag: s1 on db1139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:36:24] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10Zabe) a:03Zabe [21:41:05] RECOVERY - MariaDB Replica Lag: s4 on db2139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:42:21] RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:47:45] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:53:31] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:11] RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 48.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:25:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24345 and previous config saved to /var/cache/conftool/dbconfig/20220410-222548-ladsgroup.json [22:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:40:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24346 and previous config saved to /var/cache/conftool/dbconfig/20220410-224053-ladsgroup.json [22:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24347 and previous config saved to /var/cache/conftool/dbconfig/20220410-225559-ladsgroup.json [22:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24348 and previous config saved to /var/cache/conftool/dbconfig/20220410-231104-ladsgroup.json [23:11:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [23:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [23:11:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24349 and previous config saved to /var/cache/conftool/dbconfig/20220410-231112-ladsgroup.json [23:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log