[00:00:38] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:58] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:06:02] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 4 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [00:07:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:08:22] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [00:20:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298555)', diff saved to https://phabricator.wikimedia.org/P28374 and previous config saved to /var/cache/conftool/dbconfig/20220524-002006-ladsgroup.json [00:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:13] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [00:22:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28375 and previous config saved to /var/cache/conftool/dbconfig/20220524-002246-ladsgroup.json [00:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:53] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [00:35:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28376 and previous config saved to /var/cache/conftool/dbconfig/20220524-003511-ladsgroup.json [00:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28377 and previous config saved to /var/cache/conftool/dbconfig/20220524-003752-ladsgroup.json [00:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:10] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 170, active_shards: 300, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma [00:43:10] g_in_queue_millis: 0, active_shards_percent_as_number: 97.71986970684038 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:43:26] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, active_shards_percent_as_number: 99.6742671009772, active_shards: 306, timed_out: False, delayed_unassigned_shards: 0, unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_nodes: 2, initializing_shards: 1, active_primary_shards: 170, relocating_shards: 0, status: yellow, nu [00:43:27] in_flight_fetch: 0, number_of_pending_tasks: 0, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:45:10] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:50:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28378 and previous config saved to /var/cache/conftool/dbconfig/20220524-005016-ladsgroup.json [00:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28379 and previous config saved to /var/cache/conftool/dbconfig/20220524-005257-ladsgroup.json [00:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:54] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T0100) [01:05:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298555)', diff saved to https://phabricator.wikimedia.org/P28380 and previous config saved to /var/cache/conftool/dbconfig/20220524-010521-ladsgroup.json [01:05:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance [01:05:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance [01:05:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [01:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:27] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [01:05:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [01:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298555)', diff saved to https://phabricator.wikimedia.org/P28381 and previous config saved to /var/cache/conftool/dbconfig/20220524-010534-ladsgroup.json [01:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298555)', diff saved to https://phabricator.wikimedia.org/P28382 and previous config saved to /var/cache/conftool/dbconfig/20220524-010622-ladsgroup.json [01:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host relforge1004.eqiad.wmnet with OS bullseye [01:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28383 and previous config saved to /var/cache/conftool/dbconfig/20220524-010802-ladsgroup.json [01:08:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [01:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [01:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:08] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [01:08:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298560)', diff saved to https://phabricator.wikimedia.org/P28384 and previous config saved to /var/cache/conftool/dbconfig/20220524-010810-ladsgroup.json [01:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:47] (03PS1) 10Ryan Kemper: sre.hosts.reimage: update usage w/ req arg [cookbooks] - 10https://gerrit.wikimedia.org/r/797712 [01:09:30] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:42] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 21 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 21, active_shards: 21, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 21, delayed_unassigned_shards: 0, number_of_pending_tasks: 0 [01:13:42] _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:14:58] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 167 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 137, active_shards: 137, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 167, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number [01:14:58] light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 45.06578947368421 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:16:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:48] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1004.eqiad.wmnet with reason: host reimage [01:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:33] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1004.eqiad.wmnet with reason: host reimage [01:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:54] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28385 and previous config saved to /var/cache/conftool/dbconfig/20220524-012127-ladsgroup.json [01:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:00] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 167 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 137, active_shards: 137, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 167, delayed_unassigned_shards: 0, number_of_pending_tasks: 0 [01:26:00] _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 45.06578947368421 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:26:00] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 21 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 21, active_shards: 21, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 21, delayed_unassigned_shards: 0, number_of_pending_ [01:26:00] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:36:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28386 and previous config saved to /var/cache/conftool/dbconfig/20220524-013632-ladsgroup.json [01:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:30] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:47] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1004.eqiad.wmnet with OS bullseye [01:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:01] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:50:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298555)', diff saved to https://phabricator.wikimedia.org/P28387 and previous config saved to /var/cache/conftool/dbconfig/20220524-015137-ladsgroup.json [01:51:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [01:51:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [01:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:44] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [01:51:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298555)', diff saved to https://phabricator.wikimedia.org/P28388 and previous config saved to /var/cache/conftool/dbconfig/20220524-015145-ladsgroup.json [01:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:12] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.13 [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/797823 [02:07:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.13 [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/797823 (owner: 10TrainBranchBot) [02:09:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:09:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:12] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:11:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:31] (03CR) 10Dzahn: [C: 03+1] gitlab: retry rails console, don't keep gitlab-secrets.json [puppet] - 10https://gerrit.wikimedia.org/r/797301 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [02:24:35] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.13 [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/797823 (owner: 10TrainBranchBot) [02:32:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:36:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298555)', diff saved to https://phabricator.wikimedia.org/P28389 and previous config saved to /var/cache/conftool/dbconfig/20220524-024333-ladsgroup.json [02:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:39] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [02:44:14] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:47:47] (03PS4) 10Samwilson: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961) [02:50:10] (03PS5) 10Samwilson: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961) [02:53:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:58:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28390 and previous config saved to /var/cache/conftool/dbconfig/20220524-025838-ladsgroup.json [02:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28391 and previous config saved to /var/cache/conftool/dbconfig/20220524-031343-ladsgroup.json [03:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:32] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:21:20] PROBLEM - Query Service HTTP Port on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 4.583 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:21:41] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.081 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:12] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:34] RECOVERY - Query Service HTTP Port on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:28:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298555)', diff saved to https://phabricator.wikimedia.org/P28392 and previous config saved to /var/cache/conftool/dbconfig/20220524-032848-ladsgroup.json [03:28:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:28:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:55] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [03:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:48] (03PS1) 10KartikMistry: Enable Content and Section Translation in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304858) [04:07:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:10:26] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 274, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 33, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_ [04:10:26] in_queue_millis: 0, active_shards_percent_as_number: 89.25081433224756 https://wikitech.wikimedia.org/wiki/Search%23Administration [04:10:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298555)', diff saved to https://phabricator.wikimedia.org/P28393 and previous config saved to /var/cache/conftool/dbconfig/20220524-041034-ladsgroup.json [04:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:41] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [04:15:48] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:23:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (netbox1002), No backups: 2 (backup1002, ...), Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:25:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28394 and previous config saved to /var/cache/conftool/dbconfig/20220524-042539-ladsgroup.json [04:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:06] (03CR) 10Samwilson: [C: 03+1] Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [04:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28395 and previous config saved to /var/cache/conftool/dbconfig/20220524-044044-ladsgroup.json [04:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:40] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:52:24] 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10Marostegui) @valhallasw if you can update https://www.mediawiki.org/wiki/Wikibugs to make it clearer...I think that'd be the only pending thing before we can close t... [04:53:46] (03PS1) 10Marostegui: db1172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/798043 [04:54:41] (03CR) 10Marostegui: [C: 03+2] db1172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/798043 (owner: 10Marostegui) [04:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 1%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28396 and previous config saved to /var/cache/conftool/dbconfig/20220524-045508-root.json [04:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298555)', diff saved to https://phabricator.wikimedia.org/P28397 and previous config saved to /var/cache/conftool/dbconfig/20220524-045549-ladsgroup.json [04:55:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [04:55:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [04:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:54] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [04:55:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298555)', diff saved to https://phabricator.wikimedia.org/P28398 and previous config saved to /var/cache/conftool/dbconfig/20220524-045602-ladsgroup.json [04:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:15] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) [05:07:38] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 21, active_shards: 26, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_tasks: 0 [05:07:38] _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 61.904761904761905 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:10:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28399 and previous config saved to /var/cache/conftool/dbconfig/20220524-051011-root.json [05:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:11] !log Rename revision_actor_temp on s6 T307906 [05:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:16] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [05:17:02] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:20:40] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 21, active_shards: 26, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_ [05:20:40] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 61.904761904761905 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:20:40] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 21, active_shards: 26, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_ [05:20:40] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 61.904761904761905 Ryan Kemper https://phabricator.wikimedia.org/T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:25:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28400 and previous config saved to /var/cache/conftool/dbconfig/20220524-052515-root.json [05:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:38:00] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28401 and previous config saved to /var/cache/conftool/dbconfig/20220524-054019-root.json [05:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:55:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28402 and previous config saved to /var/cache/conftool/dbconfig/20220524-055523-root.json [05:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T0600). [06:08:09] !log Rename revision_actor_temp on s8 T307906 [06:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:15] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [06:10:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28403 and previous config saved to /var/cache/conftool/dbconfig/20220524-061027-root.json [06:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:58] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:11:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298560)', diff saved to https://phabricator.wikimedia.org/P28404 and previous config saved to /var/cache/conftool/dbconfig/20220524-061119-ladsgroup.json [06:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:25] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:12:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:12:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298555)', diff saved to https://phabricator.wikimedia.org/P28405 and previous config saved to /var/cache/conftool/dbconfig/20220524-061237-ladsgroup.json [06:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:45] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [06:15:06] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:17:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:25:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After migrating back to 10.4', diff saved to https://phabricator.wikimedia.org/P28406 and previous config saved to /var/cache/conftool/dbconfig/20220524-062531-root.json [06:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28407 and previous config saved to /var/cache/conftool/dbconfig/20220524-062625-ladsgroup.json [06:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:06] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:39:42] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28408 and previous config saved to /var/cache/conftool/dbconfig/20220524-064130-ladsgroup.json [06:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298560)', diff saved to https://phabricator.wikimedia.org/P28409 and previous config saved to /var/cache/conftool/dbconfig/20220524-065635-ladsgroup.json [06:56:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [06:56:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [06:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:42] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:56:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28410 and previous config saved to /var/cache/conftool/dbconfig/20220524-065643-ladsgroup.json [06:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T0700). [07:00:04] mainframe98: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298555)', diff saved to https://phabricator.wikimedia.org/P28411 and previous config saved to /var/cache/conftool/dbconfig/20220524-070052-ladsgroup.json [07:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:58] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:01:17] (03PS1) 10Muehlenhoff: Add more contributors [puppet] - 10https://gerrit.wikimedia.org/r/798352 [07:02:29] (03CR) 10Muehlenhoff: [C: 03+2] Add more contributors [puppet] - 10https://gerrit.wikimedia.org/r/798352 (owner: 10Muehlenhoff) [07:05:15] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Looks good. merging." [puppet] - 10https://gerrit.wikimedia.org/r/797366 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:06:36] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:10:33] (03CR) 10Muehlenhoff: "This module only ships an args.erb file which isn't used anywhere in Puppet, I think instead we can simply remove it for good?" [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:12:02] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:13:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [07:13:28] <_joe_> ok [07:13:31] <_joe_> timeouts [07:13:40] here [07:13:46] uhoh [07:13:47] <_joe_> I'll look at the backends [07:13:49] here too [07:14:42] <_joe_> can someone look at the nel dashboard for patterns? [07:14:50] not seeing any obvious drop in frontend traffic [07:14:53] _joe_: ok I will [07:14:57] * jelto around [07:15:41] looks like a spike now btw [07:15:57] <_joe_> nothing of note on either mediawiki nor the edge [07:15:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28412 and previous config saved to /var/cache/conftool/dbconfig/20220524-071557-ladsgroup.json [07:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:46] (03CR) 10Zabe: tmpreaper: Add SPDX header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:17:02] https://logstash.wikimedia.org/goto/c08e26277bf98ea92a6a8c33361a6aaa the spike [07:17:43] I'm happy to tweak the alert a little bit too in terms of how sensitive it is [07:18:00] should be recovering soon [07:18:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [07:18:24] <_joe_> godog: nah I think it's ok [07:18:33] <_joe_> I mean this wasn't a false positive [07:19:11] * volans here [07:19:14] did I miss the fun? [07:19:25] anything I can do? [07:19:54] volans: turns out it was a spike [07:20:53] _joe_: yeah fair enough, I was thinking sth like a higher threshold and/or smaller threshold but 'for' duration a little longer, anyways let's see what happens [07:21:41] ack [07:21:43] while we're on the subject, I'm happy to report that the shower of individual pages for failing services from icinga will be going away soon: https://gerrit.wikimedia.org/r/q/topic:bug%252FT291946-monitoring-and-host-removal [07:22:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:22:06] godog: I owe you a beer [07:22:27] Amir1: awww <3 will gladly accept, thank you! [07:22:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298555)', diff saved to https://phabricator.wikimedia.org/P28413 and previous config saved to /var/cache/conftool/dbconfig/20220524-072243-ladsgroup.json [07:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:50] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:23:36] (03CR) 10Volans: [C: 03+2] "Thanks for the patch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/797712 (owner: 10Ryan Kemper) [07:23:39] (03PS1) 10Majavah: nrpe::plugin: don't require a source with ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/798372 [07:24:47] Amir1: Now that the crisis is over, can we deploy/merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793402? [07:25:17] mainframe98: sure go ahead [07:25:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35507/console" [puppet] - 10https://gerrit.wikimedia.org/r/798372 (owner: 10Majavah) [07:26:03] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10fgiunchedi) 05Open→03Resolved Thank you @papaul! Resolving, we'll be following up in {T309074} [07:27:00] (03Merged) 10jenkins-bot: sre.hosts.reimage: update usage w/ req arg [cookbooks] - 10https://gerrit.wikimedia.org/r/797712 (owner: 10Ryan Kemper) [07:27:40] Amir1: I don't have +2 in operations/mediawiki-config and I'm not a deployer myself; I need someone to +2 the change and do the follow up steps; how do I do that? [07:28:01] mainframe98: you ping me ;) [07:28:57] so the config is removed in code but it's not deployed yet. It looks like they are exactly the same so it's still should be noop [07:29:39] That's right [07:30:09] (03CR) 10Ladsgroup: "Since it's the same as the default values. It's fine to deploy this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98) [07:30:13] (03PS3) 10Ladsgroup: Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98) [07:30:16] (03CR) 10Ladsgroup: [C: 03+2] Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98) [07:31:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28414 and previous config saved to /var/cache/conftool/dbconfig/20220524-073102-ladsgroup.json [07:31:03] (03Merged) 10jenkins-bot: Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98) [07:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:45] pulled to mwdebug and confirm it didn't change [07:33:40] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793402|Remove wgPriorityHints and wgPriorityHintsRatio (T308707)]] (duration: 00m 50s) [07:33:40] \o/ [07:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:45] T308707: Remove inactive code from Priority Hints experiment in MW core - https://phabricator.wikimedia.org/T308707 [07:34:49] (03PS2) 10Majavah: nrpe::plugin: don't require a source with ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/798372 [07:35:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35508/console" [puppet] - 10https://gerrit.wikimedia.org/r/798372 (owner: 10Majavah) [07:35:54] Amir1: Thanks! [07:36:19] mainframe98: Thank you for doing the work I just pressed some shiny buttons [07:36:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:36:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [07:37:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [07:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28415 and previous config saved to /var/cache/conftool/dbconfig/20220524-073738-ladsgroup.json [07:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:43] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [07:37:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28416 and previous config saved to /var/cache/conftool/dbconfig/20220524-073748-ladsgroup.json [07:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The code looks ok; however I have one main issue here. Specifically, we're tying the schema of service::catalog to the spicerack code quit" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [07:46:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298555)', diff saved to https://phabricator.wikimedia.org/P28417 and previous config saved to /var/cache/conftool/dbconfig/20220524-074607-ladsgroup.json [07:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:14] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:46:53] (03PS1) 10KartikMistry: Enable Section Translation for Hindi in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798389 (https://phabricator.wikimedia.org/T308834) [07:47:22] (03CR) 10Volans: [C: 03+2] service: add new module to expose service::catalog (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [07:48:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [07:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:31] kubetcd2005 will be going down temporarily [07:49:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Remove unused rewrite_static_assets param [puppet] - 10https://gerrit.wikimedia.org/r/778602 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [07:50:06] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:51:42] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [07:52:17] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:52:21] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28418 and previous config saved to /var/cache/conftool/dbconfig/20220524-075253-ladsgroup.json [07:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28419 and previous config saved to /var/cache/conftool/dbconfig/20220524-075259-ladsgroup.json [07:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:04] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [07:53:05] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) >>! In T271143#7951004, @bking wrote: > @volans , we are ready to do "brave mode" on the remaining CODF... [07:53:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet [07:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:50] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [07:56:06] (03Merged) 10jenkins-bot: service: add new module to expose service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [07:56:21] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:56:24] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:13] (03PS1) 10Muehlenhoff: Add some additional SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798393 [07:57:46] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:49] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:07] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/798394 [08:00:13] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Performance-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) p:05Triage→03High [08:00:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798393 (owner: 10Muehlenhoff) [08:02:20] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove static assets rewrite clause. [deployment-charts] - 10https://gerrit.wikimedia.org/r/798395 (https://phabricator.wikimedia.org/T302465) [08:02:46] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, looks good. Will merge." [puppet] - 10https://gerrit.wikimedia.org/r/797355 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:02:58] (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:06:12] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The values in the chart are for an example site, it doesn't really make sense to add them there." [deployment-charts] - 10https://gerrit.wikimedia.org/r/790357 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [08:06:56] (03PS1) 10Jbond: spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 [08:07:11] (03CR) 10Jbond: [C: 03+2] spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 (owner: 10Jbond) [08:07:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:07:28] (03CR) 10CI reject: [V: 04-1] spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 (owner: 10Jbond) [08:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298555)', diff saved to https://phabricator.wikimedia.org/P28420 and previous config saved to /var/cache/conftool/dbconfig/20220524-080758-ladsgroup.json [08:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:04] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [08:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28421 and previous config saved to /var/cache/conftool/dbconfig/20220524-080804-ladsgroup.json [08:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:51] (03PS2) 10Jbond: spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 [08:10:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Remove route for /static/current/* (rewrite_static_assets) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778601 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [08:11:30] (03CR) 10Jbond: [C: 03+2] spdx: add Rakefile, README and .conf files [puppet] - 10https://gerrit.wikimedia.org/r/798401 (owner: 10Jbond) [08:12:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [08:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:46] (03Merged) 10jenkins-bot: mediawiki: Remove route for /static/current/* (rewrite_static_assets) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778601 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [08:15:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:17:50] (03PS6) 10Giuseppe Lavagetto: mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy) [08:20:03] (03CR) 10Muehlenhoff: tmpreaper: Add SPDX header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:20:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [08:20:06] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:05] (03CR) 10Muehlenhoff: [C: 03+2] toil: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797355 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:22:12] !log resume deletion of 'swift-tegola-container' on thanos-fe2001 - T307184 [08:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:17] T307184: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 [08:23:00] Amir1: marostegui Anything except: https://phabricator.wikimedia.org/T306963#7949625 needed from the Language team to go ahead? [08:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28422 and previous config saved to /var/cache/conftool/dbconfig/20220524-082309-ladsgroup.json [08:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy) [08:27:28] kart_: grants are important, I guess only SELECT for now? [08:30:11] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) hi @Jgiannelos, I have resumed work on this and was wondering what's the theoretical limit of tiles per container? As... [08:30:26] (03PS1) 10Volans: Netbox: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798425 [08:30:58] (03Merged) 10jenkins-bot: mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy) [08:31:19] (03PS2) 10Volans: Netbox: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798425 (https://phabricator.wikimedia.org/T308013) [08:33:06] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [08:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:47] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:36] (03PS1) 10Jbond: spdx: add task to convert modules [puppet] - 10https://gerrit.wikimedia.org/r/798426 [08:35:22] (03CR) 10CI reject: [V: 04-1] spdx: add task to convert modules [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond) [08:36:37] (03CR) 10Daniel Kinzler: [C: 03+1] "yes, please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01) [08:38:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28423 and previous config saved to /var/cache/conftool/dbconfig/20220524-083814-ladsgroup.json [08:38:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:38:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:20] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:38:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28424 and previous config saved to /var/cache/conftool/dbconfig/20220524-083822-ladsgroup.json [08:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/798425 (https://phabricator.wikimedia.org/T308013) (owner: 10Volans) [08:40:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet [08:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:03] Amir1: yes. SELECT only for now. [08:42:40] I will do it ASAP [08:42:43] (03CR) 10Muehlenhoff: [C: 03+2] debian: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793778 (owner: 10Muehlenhoff) [08:43:11] (03CR) 10Volans: [C: 03+2] Netbox: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798425 (https://phabricator.wikimedia.org/T308013) (owner: 10Volans) [08:43:25] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:42] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:09] (03PS1) 10Filippo Giunchedi: thanos: fix alert 'source' url [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081) [08:44:37] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:45:41] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35509/console" [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081) (owner: 10Filippo Giunchedi) [08:47:59] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: allow deploying site-specific alerts [puppet] - 10https://gerrit.wikimedia.org/r/797201 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:48:08] (03PS3) 10Filippo Giunchedi: alerts: allow deploying site-specific alerts [puppet] - 10https://gerrit.wikimedia.org/r/797201 (https://phabricator.wikimedia.org/T305847) [08:49:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:50:08] (03CR) 10Muehlenhoff: [C: 03+2] Add some additional SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/798393 (owner: 10Muehlenhoff) [08:50:33] (03CR) 10Volans: "Some typos and a suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond) [08:50:37] godog: shall I merge along? [08:50:40] (03PS2) 10Filippo Giunchedi: alerts: take rule file site into consideration when deploying [puppet] - 10https://gerrit.wikimedia.org/r/797237 (https://phabricator.wikimedia.org/T305847) [08:50:53] moritzm: yes please! [08:51:06] sorry about that, totally forgot [08:51:13] ack, done [08:52:48] (03PS1) 10Slyngshede: Allow for Apache2 to not bind to port 80. [puppet] - 10https://gerrit.wikimedia.org/r/798446 [08:53:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28425 and previous config saved to /var/cache/conftool/dbconfig/20220524-085314-ladsgroup.json [08:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:20] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:53:52] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: take rule file site into consideration when deploying [puppet] - 10https://gerrit.wikimedia.org/r/797237 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:55:02] kart_: one other thing, the db you have is sqlite, can you make a mysql dump instead? it'd make moving the data much easier [08:55:32] (03PS2) 10Jbond: rake spdx: add convert task for profiles [puppet] - 10https://gerrit.wikimedia.org/r/798426 [08:56:17] (03PS1) 10Ladsgroup: ApiQueryBacklinksprop: Completely remove index hints [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797220 (https://phabricator.wikimedia.org/T306673) [08:57:14] (03PS1) 10Ladsgroup: Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673) [08:57:40] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35510/console" [puppet] - 10https://gerrit.wikimedia.org/r/798446 (owner: 10Slyngshede) [08:58:16] (03PS3) 10Jbond: rake spdx: add convert task for profiles [puppet] - 10https://gerrit.wikimedia.org/r/798426 [08:58:18] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond) [08:58:42] (03PS3) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) [08:58:44] (03PS1) 10Filippo Giunchedi: sre: limit mail alerts to prometheus/ops in codfw and eqiad [alerts] - 10https://gerrit.wikimedia.org/r/798448 (https://phabricator.wikimedia.org/T305847) [08:59:37] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond) [08:59:48] (03CR) 10Jbond: [C: 03+2] rake spdx: add convert task for profiles [puppet] - 10https://gerrit.wikimedia.org/r/798426 (owner: 10Jbond) [09:00:13] (03PS2) 10Filippo Giunchedi: sre: limit mail alerts to prometheus/ops in codfw and eqiad [alerts] - 10https://gerrit.wikimedia.org/r/798448 (https://phabricator.wikimedia.org/T305847) [09:00:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [09:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:27] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: limit mail alerts to prometheus/ops in codfw and eqiad [alerts] - 10https://gerrit.wikimedia.org/r/798448 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:07:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [09:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:56] (03CR) 10Btullis: [C: 03+2] Enable cassandra encryption (aqs cluster) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [09:08:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28427 and previous config saved to /var/cache/conftool/dbconfig/20220524-090819-ladsgroup.json [09:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:02] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:11:26] (03PS1) 10Slyngshede: Move Apache2 to alternative port [puppet] - 10https://gerrit.wikimedia.org/r/798450 [09:13:51] !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001 [09:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:03] (03CR) 10Slyngshede: [C: 03+2] Move Apache2 to alternative port [puppet] - 10https://gerrit.wikimedia.org/r/798450 (owner: 10Slyngshede) [09:16:27] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10fgiunchedi) 05Open→03Invalid I can't find any more errors for now, tentatively and optimistically resolving as invalid, will reopen if issues pop up again [09:20:23] (03CR) 10Vgutierrez: [C: 03+1] purged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793401 (owner: 10Muehlenhoff) [09:22:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5001.eqsin.wmnet [09:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28428 and previous config saved to /var/cache/conftool/dbconfig/20220524-092324-ladsgroup.json [09:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:54] (03PS1) 10Jcrespo: mariadb::misc: Fix motd that was marking misc hosts as core [puppet] - 10https://gerrit.wikimedia.org/r/798467 [09:25:02] (03PS3) 10Giuseppe Lavagetto: varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [09:25:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [09:28:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet [09:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:48] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: remove static assets rewrite clause. [deployment-charts] - 10https://gerrit.wikimedia.org/r/798395 (https://phabricator.wikimedia.org/T302465) (owner: 10Giuseppe Lavagetto) [09:29:59] (03PS1) 10Slyngshede: Add listen port to move repo from 80 to 8080 [puppet] - 10https://gerrit.wikimedia.org/r/798478 [09:30:57] (03CR) 10Slyngshede: [C: 03+2] Add listen port to move repo from 80 to 8080 [puppet] - 10https://gerrit.wikimedia.org/r/798478 (owner: 10Slyngshede) [09:32:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5001.eqsin.wmnet [09:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:22] !log root@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5001.eqsin.wmnet to ganeti01.svc.eqsin.wmnet [09:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:12] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5001.eqsin.wmnet to ganeti01.svc.eqsin.wmnet [09:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet [09:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [09:38:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28430 and previous config saved to /var/cache/conftool/dbconfig/20220524-093830-ladsgroup.json [09:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:36] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:41:55] (03PS1) 10Jbond: rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504 [09:42:41] (03CR) 10CI reject: [V: 04-1] rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond) [09:46:23] (03PS2) 10Jbond: rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504 [09:49:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [09:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:22] !log installing openssl security updates [09:50:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T303603)', diff saved to https://phabricator.wikimedia.org/P28431 and previous config saved to /var/cache/conftool/dbconfig/20220524-095030-ladsgroup.json [09:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:51:56] !log btullis@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001 [09:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:21] (03CR) 10Volans: [C: 03+1] "LGTM, couple of optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond) [09:52:53] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:54:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [09:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:56] (03CR) 10Marostegui: [C: 03+1] "thanks for catching this" [puppet] - 10https://gerrit.wikimedia.org/r/798467 (owner: 10Jcrespo) [09:59:11] (03PS4) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) [09:59:13] (03PS1) 10Filippo Giunchedi: Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847) [10:04:37] (03CR) 10Jbond: rake spdx: update file binary check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond) [10:05:05] (03PS1) 10Jbond: spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537 [10:05:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303603)', diff saved to https://phabricator.wikimedia.org/P28432 and previous config saved to /var/cache/conftool/dbconfig/20220524-100553-ladsgroup.json [10:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:00] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [10:06:52] (03PS3) 10Jbond: rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504 [10:06:54] (03CR) 10Jbond: rake spdx: update file binary check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond) [10:07:26] !log installing imagemagick securitx updates [10:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:09] (03PS2) 10Jbond: spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537 [10:08:38] hi all im going to temporarily disable puppet fleet wide to preform puppetmaster/db reboots [10:09:00] (03CR) 10CI reject: [V: 04-1] spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537 (owner: 10Jbond) [10:09:54] (03CR) 10Filippo Giunchedi: "Not 100% sure about the PCC diff, I wasn't expecting all the new resources, is that expected ?" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [10:10:23] (03PS3) 10Jbond: spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537 [10:10:45] jbond: ack [10:13:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [10:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:41] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb1002.eqiad.wmnet [10:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:26] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2001.codfw.wmnet [10:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:39] PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:45] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) [10:18:03] !log rebalance Ganeti cluster in eqsin T308211 [10:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:08] T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 [10:18:09] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:18:11] PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:55] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Performance-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) @BBlack do you have thoughts on this? [10:19:21] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:18] (03CR) 10Jbond: [C: 03+2] nrpe::plugin: don't require a source with ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/798372 (owner: 10Majavah) [10:20:25] !log installing vim security updates [10:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:37] RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [10:20:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28434 and previous config saved to /var/cache/conftool/dbconfig/20220524-102058-ladsgroup.json [10:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:46] (03PS2) 10Majavah: nrpe: manage sudo rules via nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/797422 [10:22:11] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:22:15] RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:48] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:25:54] (03PS3) 10Hnowlan: aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) [10:25:58] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetdb2002.codfw.wmnet [10:25:59] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35511/console" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [10:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb1002.eqiad.wmnet [10:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:37] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster2001.codfw.wmnet [10:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:38] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2002.codfw.wmnet [10:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:03] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1002.eqiad.wmnet [10:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:08] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2002.codfw.wmnet [10:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:25] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1004.eqiad.wmnet [10:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:13] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1004.eqiad.wmnet [10:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:49] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1005.eqiad.wmnet [10:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2002.codfw.wmnet [10:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1002.eqiad.wmnet [10:33:09] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2003.codfw.wmnet [10:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:14] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1003.eqiad.wmnet [10:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:34] (03PS1) 10Slyngshede: Handle port deallocation for Apache. [puppet] - 10https://gerrit.wikimedia.org/r/798570 [10:33:38] (03PS1) 10Filippo Giunchedi: test_alerts: report filename on assertion failure [alerts] - 10https://gerrit.wikimedia.org/r/798571 [10:34:07] PROBLEM - Check systemd state on ganeti2022 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:45] (03CR) 10Majavah: [V: 03+1] nrpe: manage sudo rules via nrpe::check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [10:36:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28435 and previous config saved to /var/cache/conftool/dbconfig/20220524-103603-ladsgroup.json [10:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1003.eqiad.wmnet [10:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1005.eqiad.wmnet [10:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:20] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2003.codfw.wmnet [10:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:44] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster2004.codfw.wmnet [10:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:15] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2005.codfw.wmnet [10:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793495 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [10:40:57] (03CR) 10Slyngshede: [C: 03+2] Handle port deallocation for Apache. [puppet] - 10https://gerrit.wikimedia.org/r/798570 (owner: 10Slyngshede) [10:41:02] (03CR) 10Jbond: [C: 03+2] rake spdx: update file binary check [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond) [10:41:06] (03CR) 10Jbond: [C: 03+2] spdx: add role task [puppet] - 10https://gerrit.wikimedia.org/r/798537 (owner: 10Jbond) [10:42:13] slyngs: happy for me to merge your Apache port deallocation CR [10:42:23] Yes [10:43:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2004.codfw.wmnet [10:43:08] (03CR) 10Volans: rake spdx: update file binary check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798504 (owner: 10Jbond) [10:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:37] (03PS1) 10Jbond: spdx: drop puts/debugging [puppet] - 10https://gerrit.wikimedia.org/r/798585 [10:43:47] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster2005.codfw.wmnet [10:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] spdx: drop puts/debugging [puppet] - 10https://gerrit.wikimedia.org/r/798585 (owner: 10Jbond) [10:44:47] RECOVERY - Check systemd state on ganeti2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [10:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:43] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:51:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303603)', diff saved to https://phabricator.wikimedia.org/P28436 and previous config saved to /var/cache/conftool/dbconfig/20220524-105108-ladsgroup.json [10:51:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [10:51:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [10:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:14] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [10:51:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T303603)', diff saved to https://phabricator.wikimedia.org/P28437 and previous config saved to /var/cache/conftool/dbconfig/20220524-105116-ladsgroup.json [10:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:55] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:54:31] jbond: just to confirm: do you want someone else to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/795380? [10:55:51] (03PS1) 10Jbond: C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604 [11:00:32] !log restart db1150 T308315 [11:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:39] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:03:29] (03PS1) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615 [11:04:07] (03CR) 10Muehlenhoff: C:httpd: add documentation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond) [11:04:22] (03CR) 10CI reject: [V: 04-1] C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615 (owner: 10Jbond) [11:07:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303603)', diff saved to https://phabricator.wikimedia.org/P28438 and previous config saved to /var/cache/conftool/dbconfig/20220524-110728-ladsgroup.json [11:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:34] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:12:34] (03PS1) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617 [11:14:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [11:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [11:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:49] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [11:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:58] (03CR) 10Jbond: [C: 03+2] C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond) [11:21:03] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:21:27] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond) [11:22:29] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:22:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28439 and previous config saved to /var/cache/conftool/dbconfig/20220524-112233-ladsgroup.json [11:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:56] (03PS2) 10Filippo Giunchedi: test_alerts: report filename on assertion failure [alerts] - 10https://gerrit.wikimedia.org/r/798571 [11:23:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:10] (03PS2) 10Jbond: C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604 [11:27:12] (03PS2) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615 [11:28:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35513/console" [puppet] - 10https://gerrit.wikimedia.org/r/798615 (owner: 10Jbond) [11:30:12] disabling puppet again i missed puppetmaster1001 [11:30:18] !log disable puppet fleet wide [11:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:00] (03CR) 10Filippo Giunchedi: [C: 03+2] test_alerts: report filename on assertion failure [alerts] - 10https://gerrit.wikimedia.org/r/798571 (owner: 10Filippo Giunchedi) [11:31:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28440 and previous config saved to /var/cache/conftool/dbconfig/20220524-113112-ladsgroup.json [11:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:18] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [11:33:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [11:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:38] kubetcd2004 will be going down temporarily [11:34:10] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1001.eqiad.wmnet [11:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:41] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28441 and previous config saved to /var/cache/conftool/dbconfig/20220524-113738-ladsgroup.json [11:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond) [11:39:01] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:27] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster1001.eqiad.wmnet [11:40:29] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [11:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:43] (03CR) 10Jbond: [C: 03+2] C:httpd: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/798604 (owner: 10Jbond) [11:40:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/798615 (owner: 10Jbond) [11:44:33] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:45:01] PROBLEM - PHP7 jobrunner on mw2382 is CRITICAL: connect to address 10.192.0.45 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [11:45:09] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: connect to address 10.64.48.84 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [11:45:17] PROBLEM - Apache HTTP on mw2306 is CRITICAL: connect to address 10.192.0.176 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:45:18] (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:23] PROBLEM - Apache HTTP on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:45:25] _joe_: [11:45:29] #page [11:45:33] ok i think i broke things [11:45:33] PROBLEM - Apache HTTP on mw1320 is CRITICAL: connect to address 10.64.32.41 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:45:41] PROBLEM - Apache HTTP on mw1361 is CRITICAL: connect to address 10.64.48.203 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:45:52] <_joe_> uh what's going on with apache [11:45:53] !log disable puppet on mw servers [11:45:55] jbond: ack, how can we help ? [11:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:01] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: connect to address 10.64.48.79 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:46:07] _joe_: i rolled out a patch which casued an apache relod [11:46:09] PROBLEM - Apache HTTP on mw2377 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:46:09] PROBLEM - Apache HTTP on mw2389 is CRITICAL: connect to address 10.192.0.52 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:46:09] PROBLEM - Apache HTTP on mw2403 is CRITICAL: connect to address 10.192.0.67 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:46:11] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: connect to address 10.64.48.79 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [11:46:16] i have disabled puppet every where and rolling back the patch nopw [11:46:17] PROBLEM - Check systemd state on mw2306 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28442 and previous config saved to /var/cache/conftool/dbconfig/20220524-114617-ladsgroup.json [11:46:18] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:46:21] PROBLEM - Check systemd state on mw2389 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:21] PROBLEM - PHP7 jobrunner on mw2351 is CRITICAL: connect to address 10.192.32.201 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [11:46:21] PROBLEM - Apache HTTP on mw2273 is CRITICAL: connect to address 10.192.48.95 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:23] PROBLEM - Apache HTTP on mw2254 is CRITICAL: connect to address 10.192.16.53 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:46:31] (03PS1) 10Jbond: Revert "C:httpd: allow users to pass the listen_ports to use" [puppet] - 10https://gerrit.wikimedia.org/r/797222 [11:46:35] PROBLEM - PHP7 rendering on mw2351 is CRITICAL: connect to address 10.192.32.201 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:46:37] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "C:httpd: allow users to pass the listen_ports to use" [puppet] - 10https://gerrit.wikimedia.org/r/797222 (owner: 10Jbond) [11:46:43] PROBLEM - PHP7 rendering on mw2382 is CRITICAL: connect to address 10.192.0.45 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:46:46] <_joe_> oh sigh, I think we're down [11:46:57] PROBLEM - Apache HTTP on mw1330 is CRITICAL: connect to address 10.64.32.32 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:46:59] PROBLEM - Check systemd state on mw2254 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:59] eswiki still up for me [11:47:04] <_joe_> jbond: start from eqiad with the forced puppet run [11:47:05] enwiki up here [11:47:06] emn still up for me [11:47:09] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: connect to address 10.64.48.84 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:47:11] PROBLEM - Check systemd state on mw1320 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:11] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:11] PROBLEM - Check systemd state on mw1366 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:13] PROBLEM - Apache HTTP on mw2335 is CRITICAL: connect to address 10.192.32.112 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:47:17] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:17] ack just reverting now [11:47:20] <_joe_> yeah we won't be for long [11:47:21] PROBLEM - Apache HTTP on mw2339 is CRITICAL: connect to address 10.192.32.117 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:47:29] PROBLEM - Apache HTTP on wtp1036 is CRITICAL: connect to address 10.64.16.91 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:47:29] PROBLEM - Check systemd state on parse2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:32] <_joe_> reevert and target eqiad first [11:47:35] PROBLEM - Apache HTTP on mw2297 is CRITICAL: connect to address 10.192.0.167 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:47:37] PROBLEM - Check systemd state on mw2273 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:39] PROBLEM - Apache HTTP on mw1366 is CRITICAL: connect to address 10.64.48.208 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:47:44] ack targeting equiad now [11:47:45] PROBLEM - Check systemd state on mw2360 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:47] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:47:49] PROBLEM - HTTPS-peopleweb on people1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.003 second response time https://wikitech.wikimedia.org/wiki/People.wikimedia.org [11:47:49] PROBLEM - Apache HTTP on mw2360 is CRITICAL: connect to address 10.192.32.210 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:47:51] PROBLEM - Check systemd state on mw2407 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:59] PROBLEM - Apache HTTP on mw2307 is CRITICAL: connect to address 10.192.0.177 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:48:05] PROBLEM - Check systemd state on mw2297 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:07] PROBLEM - Check systemd state on people1003 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:07] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:48:10] <_joe_> jbond: please first verify the fix sworks [11:48:15] PROBLEM - Apache HTTP on mw1333 is CRITICAL: connect to address 10.64.32.35 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:48:17] PROBLEM - Check systemd state on mw2307 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:17] PROBLEM - Apache HTTP on mw2268 is CRITICAL: connect to address 10.192.16.69 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:48:17] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01073 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:48:19] PROBLEM - PHP7 jobrunner on mw2411 is CRITICAL: connect to address 10.192.0.122 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [11:48:21] PROBLEM - Check systemd state on mw2382 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:21] PROBLEM - Check systemd state on mw2335 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:23] PROBLEM - PHP7 rendering on mw2273 is CRITICAL: connect to address 10.192.48.95 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:48:23] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:48:23] PROBLEM - Check systemd state on mw2411 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:31] PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:35] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:48:35] PROBLEM - Check systemd state on mw1333 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:37] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:48:38] _joe_: ack [11:48:40] PROBLEM - LVS kibana7 eqiad port 443/tcp - Kibana v7 env - HTTPS IPv4 #page on kibana7.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:48:41] PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:43] PROBLEM - PHP7 rendering on mw1395 is CRITICAL: connect to address 10.64.16.153 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:48:45] PROBLEM - piwik.wikimedia.org requires authentication on matomo1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:48:49] PROBLEM - Check systemd state on mw2339 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:49] PROBLEM - Check systemd state on logstash1025 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:59] PROBLEM - Check systemd state on wtp1036 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:03] PROBLEM - PHP7 rendering on mw2403 is CRITICAL: connect to address 10.192.0.67 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:49:03] PROBLEM - PHP7 rendering on mw2377 is CRITICAL: connect to address 10.192.0.40 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:49:03] PROBLEM - Check systemd state on mw1445 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:11] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:15] PROBLEM - Apache HTTP on mw1421 is CRITICAL: connect to address 10.64.0.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:49:17] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: connect to address 10.64.32.41 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:49:21] PROBLEM - Check systemd state on mw1421 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:21] PROBLEM - PHP7 rendering on mw1330 is CRITICAL: connect to address 10.64.32.32 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:49:21] PROBLEM - Apache HTTP on mw2272 is CRITICAL: connect to address 10.192.48.94 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:49:29] PROBLEM - PHP7 rendering on mw2254 is CRITICAL: connect to address 10.192.16.53 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:49:29] PROBLEM - PHP7 rendering on mw2311 is CRITICAL: connect to address 10.192.16.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:49:41] PROBLEM - Check systemd state on mw2272 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:48] <_joe_> jbond: it's still not working AFAICT [11:49:56] <_joe_> so please don't run puppet everywhere [11:49:59] PROBLEM - Apache HTTP on wtp1033 is CRITICAL: connect to address 10.64.16.88 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:50:01] PROBLEM - Apache HTTP on parse2002 is CRITICAL: connect to address 10.192.0.183 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:50:09] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:11] PROBLEM - PHP7 rendering on mw1361 is CRITICAL: connect to address 10.64.48.203 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:50:15] _joe_: ack not running puppet anywhere troublkshooting on mw1395 [11:50:19] PROBLEM - Check systemd state on mw2268 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:21] PROBLEM - PHP7 rendering on mw2339 is CRITICAL: connect to address 10.192.32.117 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:50:21] PROBLEM - PHP7 rendering on mw2407 is CRITICAL: connect to address 10.192.0.75 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:50:23] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: connect to address 10.64.0.46 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:50:25] PROBLEM - Check systemd state on doc2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:25] PROBLEM - Check systemd state on mw2403 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:27] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:39] PROBLEM - Check systemd state on wtp1033 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:45] PROBLEM - Check systemd state on mw2408 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:46] <_joe_> jbond: it seems it tries to listen on port 443 [11:50:59] PROBLEM - Check systemd state on parse2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:01] PROBLEM - PHP7 rendering on mw1333 is CRITICAL: connect to address 10.64.32.35 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:51:03] PROBLEM - PHP7 rendering on mw1421 is CRITICAL: connect to address 10.64.0.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:51:17] jbond: your patch makes apache2 listen on 443 regardless whether mod_ssl is enabled [11:51:19] yes and envoy is on there one sec let me remove that from the config via cumin [11:51:25] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:51:28] the default file wraps it in IfModule [11:51:31] PROBLEM - PHP7 rendering on mw2360 is CRITICAL: connect to address 10.192.32.210 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:51:35] PROBLEM - PHP7 rendering on mw2408 is CRITICAL: connect to address 10.192.0.76 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:51:43] PROBLEM - Apache HTTP on wtp1048 is CRITICAL: connect to address 10.64.48.166 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:51:43] PROBLEM - Check systemd state on parse2012 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:47] PROBLEM - Apache HTTP on mw1395 is CRITICAL: connect to address 10.64.16.153 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:51:49] PROBLEM - Apache HTTP on mw2292 is CRITICAL: connect to address 10.192.0.162 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:51:53] PROBLEM - Apache HTTP on mw2408 is CRITICAL: connect to address 10.192.0.76 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:51:53] PROBLEM - PHP7 rendering on mw2411 is CRITICAL: connect to address 10.192.0.122 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:51:57] PROBLEM - Apache HTTP on parse2009 is CRITICAL: connect to address 10.192.16.25 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:51:59] <_joe_> yeah thanks taavi I was about to point that out [11:52:07] PROBLEM - Apache HTTP on parse2012 is CRITICAL: connect to address 10.192.32.196 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:52:23] PROBLEM - Check systemd state on logstash2030 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:29] PROBLEM - PHP7 rendering on mw1366 is CRITICAL: connect to address 10.64.48.208 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:33] PROBLEM - PHP7 rendering on mw2292 is CRITICAL: connect to address 10.192.0.162 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:35] PROBLEM - PHP7 rendering on mw2297 is CRITICAL: connect to address 10.192.0.167 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:35] PROBLEM - PHP7 rendering on mw2307 is CRITICAL: connect to address 10.192.0.177 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:37] PROBLEM - PHP7 rendering on mw2268 is CRITICAL: connect to address 10.192.16.69 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:39] PROBLEM - PHP7 rendering on parse2001 is CRITICAL: connect to address 10.192.0.182 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:43] RECOVERY - PHP7 rendering on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303603)', diff saved to https://phabricator.wikimedia.org/P28443 and previous config saved to /var/cache/conftool/dbconfig/20220524-115243-ladsgroup.json [11:52:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:52:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:52:47] RECOVERY - Apache HTTP on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:50] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:52:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T303603)', diff saved to https://phabricator.wikimedia.org/P28444 and previous config saved to /var/cache/conftool/dbconfig/20220524-115251-ladsgroup.json [11:52:53] PROBLEM - Check systemd state on mw1361 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:55] PROBLEM - PHP7 rendering on parse2002 is CRITICAL: connect to address 10.192.0.183 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:01] PROBLEM - Check systemd state on mw2351 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:03] PROBLEM - Check systemd state on mw2377 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:15] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:15] PROBLEM - Check systemd state on wtp1048 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:37] PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: connect to address 10.64.16.88 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:49] PROBLEM - Check systemd state on wtp1043 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:57] PROBLEM - PHP7 rendering on wtp1048 is CRITICAL: connect to address 10.64.48.166 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:54:01] PROBLEM - PHP7 rendering on parse2012 is CRITICAL: connect to address 10.192.32.196 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:54:05] RECOVERY - Apache HTTP on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:54:15] RECOVERY - Check systemd state on mw2273 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:29] PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:35] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.03066 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:54:45] (JobUnavailable) firing: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:54:51] PROBLEM - PHP7 rendering on parse2009 is CRITICAL: connect to address 10.192.16.25 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:55:01] PROBLEM - PHP7 rendering on wtp1043 is CRITICAL: connect to address 10.64.48.161 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:55:05] PROBLEM - PHP7 rendering on mw2306 is CRITICAL: connect to address 10.192.0.176 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:55:09] PROBLEM - PHP7 rendering on mw2335 is CRITICAL: connect to address 10.192.32.112 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:55:19] PROBLEM - PHP7 rendering on wtp1036 is CRITICAL: connect to address 10.64.16.91 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:55:23] PROBLEM - Check systemd state on logstash2031 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:25] RECOVERY - PHP7 rendering on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:55:25] PROBLEM - Check systemd state on mw1330 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:39] PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:56:18] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:41] PROBLEM - Apache HTTP on mw2311 is CRITICAL: connect to address 10.192.16.158 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:56:45] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: connect to address 10.64.0.46 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:56:45] PROBLEM - Apache HTTP on mw2407 is CRITICAL: connect to address 10.192.0.75 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:56:53] PROBLEM - Apache HTTP on wtp1043 is CRITICAL: connect to address 10.64.48.161 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:57:07] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:57:09] RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:57:23] RECOVERY - PHP7 rendering on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:57:23] RECOVERY - Apache HTTP on mw2254 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:57:25] RECOVERY - PHP7 rendering on mw2335 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:57:27] RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:29] RECOVERY - Check systemd state on mw1361 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:33] RECOVERY - Check systemd state on mw1333 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:37] RECOVERY - Check systemd state on mw2408 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:37] RECOVERY - Check systemd state on mw2377 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:41] RECOVERY - PHP7 rendering on mw2351 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:57:41] RECOVERY - Check systemd state on mw1440 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:45] RECOVERY - Check systemd state on mw1330 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:51] RECOVERY - Check systemd state on mw2339 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:53] RECOVERY - PHP7 rendering on mw2382 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:57:53] RECOVERY - PHP7 rendering on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:57:55] RECOVERY - PHP7 rendering on mw1421 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:09] RECOVERY - PHP7 rendering on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:09] PROBLEM - Check systemd state on parse2009 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:09] RECOVERY - PHP7 rendering on mw2377 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:11] RECOVERY - Check systemd state on mw1445 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:11] RECOVERY - Apache HTTP on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:15] RECOVERY - Check systemd state on mw2254 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:15] RECOVERY - PHP7 jobrunner on mw2382 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:58:15] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:58:21] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:21] RECOVERY - Apache HTTP on mw1421 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:23] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:58:23] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:25] RECOVERY - PHP7 rendering on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:25] RECOVERY - Check systemd state on mw1366 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:27] RECOVERY - Apache HTTP on mw2335 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:27] RECOVERY - Check systemd state on mw1421 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:27] RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:29] RECOVERY - Apache HTTP on mw2272 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:29] RECOVERY - PHP7 rendering on mw2408 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:29] RECOVERY - Apache HTTP on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:33] RECOVERY - Apache HTTP on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:35] RECOVERY - PHP7 rendering on mw2254 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:37] RECOVERY - PHP7 rendering on mw2311 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:43] RECOVERY - Apache HTTP on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:45] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:45] RECOVERY - PHP7 rendering on mw2411 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:58:45] RECOVERY - Apache HTTP on mw2408 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:47] RECOVERY - Apache HTTP on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:49] RECOVERY - Check systemd state on mw2272 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:51] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:51] RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:58:57] RECOVERY - Check systemd state on mw2360 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:59] RECOVERY - Apache HTTP on mw2311 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:01] RECOVERY - Apache HTTP on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:03] RECOVERY - Check systemd state on mw2407 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:03] RECOVERY - Apache HTTP on mw2407 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:11] RECOVERY - Apache HTTP on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:15] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:19] RECOVERY - Check systemd state on mw2297 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:19] RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:23] RECOVERY - Apache HTTP on mw2377 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:23] RECOVERY - Apache HTTP on mw2389 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:23] RECOVERY - PHP7 rendering on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:23] RECOVERY - Apache HTTP on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:25] RECOVERY - PHP7 rendering on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:27] RECOVERY - PHP7 rendering on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:27] RECOVERY - PHP7 rendering on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:27] RECOVERY - Check systemd state on mw2268 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:29] RECOVERY - PHP7 rendering on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:29] RECOVERY - Apache HTTP on mw2268 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:59:29] RECOVERY - Check systemd state on mw2307 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:29] RECOVERY - PHP7 rendering on mw2268 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:29] RECOVERY - PHP7 rendering on mw2407 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:59:31] RECOVERY - PHP7 jobrunner on mw2411 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:59:33] RECOVERY - Check systemd state on mw2403 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:35] RECOVERY - Check systemd state on mw2382 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:35] RECOVERY - Check systemd state on mw2306 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:35] RECOVERY - Check systemd state on mw2335 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:39] RECOVERY - Check systemd state on mw2411 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:39] RECOVERY - PHP7 jobrunner on mw2351 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:59:39] RECOVERY - Check systemd state on mw2389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:18] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28445 and previous config saved to /var/cache/conftool/dbconfig/20220524-120122-ladsgroup.json [12:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:54] (03PS1) 10Majavah: httpd: use default ports.conf if nothing else was configured [puppet] - 10https://gerrit.wikimedia.org/r/798631 [12:01:58] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:02:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:02:52] PROBLEM - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.132 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:03:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798631 (owner: 10Majavah) [12:03:13] (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:03:31] (03PS1) 10Giuseppe Lavagetto: httpd: reintroduce the default debian ports.conf where no changes were expected. [puppet] - 10https://gerrit.wikimedia.org/r/798633 [12:03:41] RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:47] mmhhh looking at the thanos-query thing, I suspect that's the apache issue jbond [12:04:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [12:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:16] godog: definetly possible patch being reviewed now [12:04:31] so far I see no end users reporting problems anywhere [12:04:39] (03CR) 10CI reject: [V: 04-1] httpd: use default ports.conf if nothing else was configured [puppet] - 10https://gerrit.wikimedia.org/r/798631 (owner: 10Majavah) [12:04:45] (03CR) 10CI reject: [V: 04-1] httpd: reintroduce the default debian ports.conf where no changes were expected. [puppet] - 10https://gerrit.wikimedia.org/r/798633 (owner: 10Giuseppe Lavagetto) [12:05:01] I am monitoring phabricator and other channels [12:05:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] httpd: use default ports.conf if nothing else was configured [puppet] - 10https://gerrit.wikimedia.org/r/798631 (owner: 10Majavah) [12:05:29] At no point did enwiki go down as far as I saw [12:05:44] jbond: ack [12:05:57] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:06:02] * Emperor here (sorry, was eating lunch) [12:06:05] !log disable puppet on c:httpd [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:18] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:06:35] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:06:55] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.129 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:06:58] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:07:28] RECOVERY - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 1.132 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:07:43] RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:13] RECOVERY - HTTPS-peopleweb on people1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1952 bytes in 1.008 second response time https://wikitech.wikimedia.org/wiki/People.wikimedia.org [12:08:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T303603)', diff saved to https://phabricator.wikimedia.org/P28446 and previous config saved to /var/cache/conftool/dbconfig/20220524-120816-ladsgroup.json [12:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:22] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:08:25] RECOVERY - Check systemd state on people1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:09] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:21] RECOVERY - people.wikimedia.org requires authentication on people1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.010 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:09:33] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:09:37] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.134 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:09:41] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:15] (03PS8) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [12:10:57] RECOVERY - Check systemd state on wtp1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:57] RECOVERY - PHP7 rendering on parse2002 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:11:03] RECOVERY - PHP7 rendering on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:11:15] RECOVERY - Check systemd state on wtp1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:17] RECOVERY - Check systemd state on parse2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:25] RECOVERY - Check systemd state on wtp1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:33] RECOVERY - Check systemd state on parse2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:33] RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:11:45] RECOVERY - Check systemd state on wtp1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:51] RECOVERY - PHP7 rendering on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:11:55] RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:11:55] RECOVERY - PHP7 rendering on parse2012 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:11:55] RECOVERY - Apache HTTP on parse2001 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:11:57] RECOVERY - Check systemd state on parse2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:59] RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:12:03] RECOVERY - Check systemd state on parse2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:07] RECOVERY - Apache HTTP on parse2009 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:12:17] RECOVERY - Apache HTTP on parse2012 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:12:23] RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:12:25] RECOVERY - Apache HTTP on parse2002 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:12:29] RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:12:33] PROBLEM - Check systemd state on logstash2030 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:41] RECOVERY - PHP7 rendering on parse2009 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:12:45] RECOVERY - PHP7 rendering on parse2001 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:12:51] RECOVERY - PHP7 rendering on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:14:08] (03PS9) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [12:16:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28447 and previous config saved to /var/cache/conftool/dbconfig/20220524-121627-ladsgroup.json [12:16:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:16:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:16:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:34] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:16:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:39] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:41] RECOVERY - Check systemd state on logstash2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298560)', diff saved to https://phabricator.wikimedia.org/P28448 and previous config saved to /var/cache/conftool/dbconfig/20220524-121641-ladsgroup.json [12:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:43] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:53] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 302 Found - 564 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:57] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:57] RECOVERY - Check systemd state on doc2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:11] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:17:21] RECOVERY - Check systemd state on logstash2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:23] RECOVERY - LVS kibana7 eqiad port 443/tcp - Kibana v7 env - HTTPS IPv4 #page on kibana7.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10033 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:17:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [12:17:27] RECOVERY - piwik.wikimedia.org requires authentication on matomo1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 542 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:31] RECOVERY - Check systemd state on logstash1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:53] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:33] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 302 Found - 550 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:18:36] (03PS10) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [12:19:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35516/console" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [12:20:18] (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:18] (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:22] interestingly for kibana it was a partial failure as seen by prometheus, due to lvs hashing, 1005 did see the failure but 1006 did not [12:22:34] (03PS1) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223 [12:22:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [12:22:54] (03PS2) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617 [12:22:56] (03CR) 10CI reject: [V: 04-1] C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223 (owner: 10Jbond) [12:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:02] (03CR) 10Elukey: [V: 03+1] "After an email exchange with Eric we decided to move the config to a multi-instance config with a single cassandra instance for each node," [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [12:23:20] (03CR) 10CI reject: [V: 04-1] P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617 (owner: 10Jbond) [12:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P28449 and previous config saved to /var/cache/conftool/dbconfig/20220524-122321-ladsgroup.json [12:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:33] (03Abandoned) 10Slyngshede: Allow for Apache2 to not bind to port 80. [puppet] - 10https://gerrit.wikimedia.org/r/798446 (owner: 10Slyngshede) [12:27:17] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) Happening again. Nothing in the apache2.log on mediawiki12 since 11:55 (UTC?) [12:27:38] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) >>! In T302699#7952842, @dom_walden wrote: > Happening again. Nothing in the apache2.log on mediawiki... [12:30:32] !log installing openldap security updates [12:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [12:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:36] (03PS2) 10Jbond: C:httpd: allow users to pass the listen_ports to use [puppet] - 10https://gerrit.wikimedia.org/r/797223 [12:32:04] (03PS3) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617 [12:33:18] (03PS2) 10Muehlenhoff: Only add component/memcached16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) [12:33:59] PROBLEM - Host kubestagetcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:45] (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:35:45] RECOVERY - Host kubestagetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [12:36:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [12:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P28450 and previous config saved to /var/cache/conftool/dbconfig/20220524-123826-ladsgroup.json [12:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:00] (03CR) 10Elukey: [V: 03+1 C: 03+2] Set fixed uid/gid for kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/797127 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey) [12:41:58] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:46:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [12:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/793744 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [12:52:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [12:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T303603)', diff saved to https://phabricator.wikimedia.org/P28452 and previous config saved to /var/cache/conftool/dbconfig/20220524-125331-ladsgroup.json [12:53:36] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite hi! There is no rush at the moment of course, but I am wondering what remaining clients needed to be migrated before being able to switch the broker's T... [12:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:38] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1300). [13:00:05] MdsShakil and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:39] hello [13:00:51] Good afternoon koi (•‿•) [13:01:43] (03PS4) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) [13:03:23] hey, I can deploy if no-one else is around [13:04:59] jouncebot: nowandnext [13:04:59] For the next 0 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1300) [13:04:59] In 2 hour(s) and 55 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1600) [13:05:17] (03PS12) 10Majavah: Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil) [13:05:23] MdsShakil: are you familiar with the backport process? [13:05:26] hello Amir1 [13:05:37] hello :D [13:06:00] I'll wait if you want to do it, once done I have some stuff [13:06:02] Yes, Once there was an opportunity [13:06:21] (03CR) 10Majavah: [C: 03+2] Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil) [13:06:38] great! I'll ping you once your patch is testable [13:07:09] (03Merged) 10jenkins-bot: Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil) [13:07:18] (03CR) 10Vgutierrez: [WIP] esitest service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [13:08:15] MdsShakil: please test your change on mwdebug1001.eqiad.wmnet [13:09:45] Looking goods [13:09:52] ok, deploying [13:10:58] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793790|Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki (T308945)]] (duration: 00m 53s) [13:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:05] T308945: Remove patrol rights from autoconfirmed users and create a separate user group on bnwiki - https://phabricator.wikimedia.org/T308945 [13:11:22] (03PS1) 10Ladsgroup: mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963) [13:13:07] koi: with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/792752, is it expected that the logos downloaded from commons via `python3 logos/manage.py update zhwikisource` don't match what's already in the repository? [13:13:37] Thanks taavi [13:13:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:09] yes for the size, as its width is not 135px [13:14:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:07] (03CR) 10CDanis: "one nit one question" [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:15:39] RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:05] (03CR) 10CDanis: [C: 03+1] fastnetmon: remove alert, ported to Prometheus / Alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:17:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:25] not sure if I understand - are you saying that the commons file can't actually be used to generate the logo files? I thought that was the point of declaring commons: in logos.yaml [13:19:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/791567 (owner: 10Jbond) [13:19:11] RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:34] I mean the file currently inside this repository was indeed generated from SVG file on commons - and recently I made some amendment of the file on commons (for its width-height ratio) [13:19:53] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:58] and the file inside another patch submitted is generated from the new commons file [13:20:00] (03PS1) 10Jbond: netbox: add discovery name [puppet] - 10https://gerrit.wikimedia.org/r/798663 [13:20:07] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10akosiaris) [13:20:17] ahh, now I understand. thanks! [13:20:26] (03CR) 10Herron: [C: 03+1] Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:20:39] (03PS4) 10Majavah: zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:20:41] (03CR) 10Jbond: [C: 03+2] netbox: add discovery name [puppet] - 10https://gerrit.wikimedia.org/r/798663 (owner: 10Jbond) [13:20:52] (03CR) 10Majavah: [C: 03+2] zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:20:53] PROBLEM - Ganeti memory on ganeti2030 is CRITICAL: CRIT Memory 97% used. Largest process: qemu-system-x86 (20037) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [13:20:54] (03CR) 10Marostegui: [C: 03+1] mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963) (owner: 10Ladsgroup) [13:21:24] (03PS2) 10Ladsgroup: mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963) [13:21:26] (03PS1) 10Alexandros Kosiaris: Add aikochou and kevinbazira to deployment [puppet] - 10https://gerrit.wikimedia.org/r/798664 (https://phabricator.wikimedia.org/T308308) [13:21:29] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Add cxserverdb grant [puppet] - 10https://gerrit.wikimedia.org/r/798661 (https://phabricator.wikimedia.org/T306963) (owner: 10Ladsgroup) [13:21:43] (03Merged) 10jenkins-bot: zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:22:02] (03CR) 10Herron: [C: 03+1] thanos: fix alert 'source' url [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081) (owner: 10Filippo Giunchedi) [13:22:36] I'm guessing the first one can't be tested since it's comments only? [13:22:38] (03CR) 10Elukey: "Hi Alex! Already filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/791036, lemme know if it is ok or if I have to drop it :)" [puppet] - 10https://gerrit.wikimedia.org/r/798664 (https://phabricator.wikimedia.org/T308308) (owner: 10Alexandros Kosiaris) [13:23:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:53] yes, but need a sync [13:24:03] ack, will do [13:24:14] (03PS2) 10Majavah: zhwikisource: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793127 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:24:21] (03CR) 10Majavah: [C: 03+2] zhwikisource: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793127 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:25:01] !log taavi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:792752|zhwikisource: Declare commons files for logo (T308620)]] (duration: 00m 53s) [13:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:08] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [13:25:09] (03Merged) 10jenkins-bot: zhwikisource: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793127 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [13:25:54] !log taavi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:792752|zhwikisource: Declare commons files for logo (T308620)]] (duration: 00m 52s) [13:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:03] (03PS2) 10Filippo Giunchedi: Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847) [13:26:05] (03PS5) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) [13:26:17] (03PS1) 10Alexandros Kosiaris: Add sgimeno to deployment [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) [13:26:20] (03CR) 10Filippo Giunchedi: "Thank you for the review" [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:26:25] koi: the second one can now be tested on mwdebug1001 [13:26:32] looking [13:26:47] and LGTM [13:27:06] ok, syncing [13:27:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] (03PS2) 10Filippo Giunchedi: fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) [13:27:58] !log taavi@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:793127|zhwikisource: Optimize logo per commons files (T308620)]] (duration: 00m 55s) [13:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:28:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:16] (03CR) 10Filippo Giunchedi: fastnetmon: remove alert, ported to Prometheus / Alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:28:18] done [13:28:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:24] ty! [13:28:25] anyone have anything else to deploy? [13:28:28] Amir1: ^ [13:28:37] mine takes time to merge [13:28:56] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/797220 [13:29:04] I +2 it and I can self-serve [13:29:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:29:08] (03CR) 10Ladsgroup: [C: 03+2] ApiQueryBacklinksprop: Completely remove index hints [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797220 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [13:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:16] (03CR) 10Majavah: "There's also a group called 'restricted' which may be more suitable here" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [13:29:25] ack [13:30:17] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: fix alert 'source' url [puppet] - 10https://gerrit.wikimedia.org/r/798438 (https://phabricator.wikimedia.org/T309081) (owner: 10Filippo Giunchedi) [13:34:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:59] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01022 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:35:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:35:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:20] i think thats a lie [13:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:07] (03CR) 10CDanis: [C: 03+1] sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:36:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:38] (03CR) 10CDanis: [C: 03+1] fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:36:58] jbond: icinga is keeping you on your toes [13:37:17] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002044 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:37:30] LD [13:37:36] thats better icinga [13:37:37] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004088 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:39:33] unrelated, ganeti2030 has started swapping due to memory pressure [13:39:44] a rebalance may be needed [13:39:55] (this means my alert is working as intended) [13:40:11] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:40:17] (03PS6) 10Filippo Giunchedi: sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) [13:40:54] if there is ongoing ganeti reboots it will fix itself, otherwise I will have a look after lunch [13:42:33] this will balance itself out along with the reboots [13:42:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [13:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:51] 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: testing non-RAIDing SSDs [13:43:26] (03CR) 10Filippo Giunchedi: [V: 03+2] sre: add fastnetmon alerting page [alerts] - 10https://gerrit.wikimedia.org/r/793723 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:43:37] (03CR) 10Filippo Giunchedi: [C: 03+2] fastnetmon: remove alert, ported to Prometheus / Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/793731 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:44:17] thanks moritzm! Still, I think the alert is useful to surface unnotice issues :-D [13:45:00] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: retry rails console, don't keep gitlab-secrets.json [puppet] - 10https://gerrit.wikimedia.org/r/797301 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [13:45:08] (03Merged) 10jenkins-bot: ApiQueryBacklinksprop: Completely remove index hints [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797220 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [13:47:20] (03CR) 10Filippo Giunchedi: [C: 03+2] Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:47:25] (03PS3) 10Filippo Giunchedi: Enforce hashtag-page in summary [alerts] - 10https://gerrit.wikimedia.org/r/798526 (https://phabricator.wikimedia.org/T305847) [13:48:57] (03CR) 10Filippo Giunchedi: "LGTM overall, will let others vote on this tho" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [13:49:54] (03PS1) 10Jbond: O:gerrit: move code around [puppet] - 10https://gerrit.wikimedia.org/r/798677 [13:50:29] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/api/ApiQueryBacklinksprop.php: Backport: [[gerrit:797220|ApiQueryBacklinksprop: Completely remove index hints (T306673)]] (duration: 00m 55s) [13:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:35] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [13:51:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:52:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:38] (03PS2) 10Ladsgroup: Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673) [13:53:43] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [13:54:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:32] (03Merged) 10jenkins-bot: Revert "Revert read new on frwiki for templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797221 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [13:54:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] mhhh checking, I think it might be timeouts [13:55:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:27] <_joe_> godog: what do you mean? [13:55:37] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797221|Revert "Revert read new on frwiki for templatelinks migration"]] (duration: 00m 52s) [13:55:37] * Emperor twitches [13:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:57] _joe_: the probe hitting timeouts, though I thought I tweaked it [13:56:32] <_joe_> frankly I'm not sure I see issues with thumbor [13:56:43] <_joe_> Also not ure if it means eqiad or codfw [13:57:08] (03PS2) 10Jbond: O:gerrit: Pass rendered ports.conf config to httpd file [puppet] - 10https://gerrit.wikimedia.org/r/798677 [13:57:53] <_joe_> heh the 75th percentile is a bit elevated [13:58:28] that'd be eqiad (from the dashboard and the page) I'm also checking the 20s timeout we've set is actually being honored [13:58:45] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:58:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35518/console" [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond) [13:59:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:34] <_joe_> it is a real problem though [13:59:38] <_joe_> thumbor is overloaded [14:00:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:01:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:32] there is a bit of a spike in originals uploads (but since about 10:00 when it peaked) [14:02:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [14:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:35] dont see a high load of thu/me dont see anything in sampled-1000 that look strange for thumb [14:03:40] cf https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=eqiad&var-prometheus=eqiad+prometheus%2Fops&from=1653374687094&to=1653400790979&viewPanel=31 [14:03:43] also definitely not the 20s timeout we specified in the blackbox-exporter config [14:03:46] ts=2022-05-24T14:02:25.359Z caller=main.go:169 module=http_thumbor_ip4 target=http://[10.2.2.24]:8800/healthcheck level=debug msg="Beginning probe" probe=http timeout_seconds=2.5 [14:04:26] (I think new uploads => new thumbs => more thumbor load?) [14:05:06] we've had bigger peaks in the last 6 months, though, so perhaps a red herring [14:05:22] Emperor: yeah the => implications are correct [14:06:55] so are we looking at higher-than-usual load, but overly-aggresive paging because the timeout isn't the 20s we were expecting? [14:08:21] indeed [14:13:40] yeah so we set 3s in prometheus at the job level, and that takes precedence over the blackbox configuration [14:19:45] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:42] (03CR) 10Volans: [C: 03+1] "LGTM, but I'd like someone elses too to go through it with more contex on what happen earlier today." [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond) [14:23:44] (03CR) 10Jbond: [C: 03+2] P:ssh::client: Add GSSAPIDelegateCredentials support to ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791567 (owner: 10Jbond) [14:27:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:34] <_joe_> to be clear: thumbor is not down, just slow [14:28:12] (03CR) 10Dzahn: "The comment "what happened earlier today" makes me curious. This is a reaction to a specific event? Was gerrit down or something?" [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond) [14:28:16] <_joe_> one thing we could do as an emergency measure is to cancel all thumbnailrender jobs for these pdfs [14:28:43] (03CR) 10Jbond: [V: 03+1] O:gerrit: Pass rendered ports.conf config to httpd file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond) [14:30:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is not the ccorrect way to handle this, and could have backscatter effects. Please wait before touching httpd/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond) [14:31:16] (03PS1) 10Filippo Giunchedi: hieradata: temp disable paging for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/798706 [14:31:29] if anyone wants to stamp ^ [14:32:11] (03CR) 10Jbond: [V: 03+1] O:gerrit: Pass rendered ports.conf config to httpd file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond) [14:32:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:50] 👀 [14:35:26] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks - though we should capture the need to get the timeout honoured properly?" [puppet] - 10https://gerrit.wikimedia.org/r/798706 (owner: 10Filippo Giunchedi) [14:36:24] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:36:27] properly> err, I meant in a phab item or similar which is easier to track than a comment :) [14:38:00] Emperor: thank you, yes I'll attach a phab task too [14:38:31] (03PS1) 10Giuseppe Lavagetto: gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 [14:39:01] <_joe_> jbond: ^^ [14:39:11] (03CR) 10CI reject: [V: 04-1] gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto) [14:39:25] <_joe_> I still need to remove the class from the role, sigh [14:39:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [14:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:52] (03PS2) 10Giuseppe Lavagetto: gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 [14:40:14] (03PS2) 10Filippo Giunchedi: hieradata: temp disable paging for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/798706 (https://phabricator.wikimedia.org/T309107) [14:40:29] _joe_: i think you will need to lint:ignore the httpd class [14:40:31] (03CR) 10CI reject: [V: 04-1] gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto) [14:40:33] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: temp disable paging for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798706 (https://phabricator.wikimedia.org/T309107) (owner: 10Filippo Giunchedi) [14:41:10] (03Abandoned) 10Jbond: O:gerrit: Pass rendered ports.conf config to httpd file [puppet] - 10https://gerrit.wikimedia.org/r/798677 (owner: 10Jbond) [14:41:20] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35520/console" [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto) [14:41:23] <_joe_> jbond: yeah w/e [14:41:36] <_joe_> I mean the right thing to do is to do those things in a profile [14:41:45] yes i agree [14:42:05] (03PS1) 10Muehlenhoff: Allow new idp-test hosts in Ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/798709 (https://phabricator.wikimedia.org/T308214) [14:42:30] (03PS3) 10Jbond: gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto) [14:42:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto) [14:42:57] <_joe_> jbond: so the way the httpd module was written, the idea was that ports would be managed this way [14:43:06] <_joe_> if someone wanted something non-standard [14:43:31] ack ill leave httpd alone then and use this going forward [14:43:38] thx [14:44:05] <_joe_> We clearly need better docs :) [14:44:13] <_joe_> I'll merge thsi and see it works [14:44:20] thanks [14:44:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] gerrit: properly handle ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/798707 (owner: 10Giuseppe Lavagetto) [14:44:48] docs are probably fine i should have just been paitent and waited for a review [14:46:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [14:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:47] <_joe_> change merged, gerrit works [14:49:56] great thanks [14:51:58] (03PS4) 10Jbond: P:aptrepo::private: update to use httpd listen_ports [puppet] - 10https://gerrit.wikimedia.org/r/798617 [14:52:26] (03CR) 10Jbond: P:aptrepo::private: update to use httpd listen_ports (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798617 (owner: 10Jbond) [14:57:49] (03PS1) 10Jbond: C:requesttracker: drop requesttracker::apache [puppet] - 10https://gerrit.wikimedia.org/r/798727 [14:59:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35522/console" [puppet] - 10https://gerrit.wikimedia.org/r/798727 (owner: 10Jbond) [14:59:45] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:requesttracker: drop requesttracker::apache [puppet] - 10https://gerrit.wikimedia.org/r/798727 (owner: 10Jbond) [15:04:48] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 55.82 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:05:54] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 56.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:07:11] (03PS1) 10Jbond: C:reprepro: ensure /var/lib/reprepro/.bashrc exists [puppet] - 10https://gerrit.wikimedia.org/r/798740 [15:09:20] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 59.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:09:57] seems like some unusual ripples that legitimately trip those, but so far doesn't look really out of whack, either, may just be "normal" -ish external variation from heavier clients [15:10:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [15:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:50] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798709 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [15:10:58] ulsfo in particular, and I remember some thumbor load mentioned earlier? could be related (ulsfo tends to take some of the big tech company traffic, and they tend to do thumbory things sometimes?) [15:12:18] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 77.14 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:12:30] <_joe_> bblack: no, thumbor is self-inflicted pain [15:12:36] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 86.19 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:12:58] <_joe_> someone uploads a 1k pages pdf and we generate 4k thumbnails [15:13:02] <_joe_> one page at a time [15:13:08] <_joe_> so we render that pdf 4k times [15:15:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [15:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/798740 (owner: 10Jbond) [15:17:49] 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10MatthewVernon) There are (at least!) 4 ways to configure the RAID controller - its own setup utility (hit `^r` during boot), the general BIOS setup... [15:20:08] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 4 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [15:20:41] uh? [15:21:42] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [15:21:45] vgutierrez@acmechief1001:~$ sudo -i openssl x509 -dates -noout -in /var/lib/acme-chief/certs/gerrit/live/rsa-2048.crt [15:21:45] notBefore=Apr 28 20:28:01 2022 GMT [15:21:45] notAfter=Jul 27 20:28:00 2022 GMT [15:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:12] (03PS1) 10Jbond: C:cfssl: create a refresh only resource to force resigns [puppet] - 10https://gerrit.wikimedia.org/r/798765 [15:22:15] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:22:16] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [15:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35523/console" [puppet] - 10https://gerrit.wikimedia.org/r/798765 (owner: 10Jbond) [15:23:16] _joe_: manual reload of httpd? [15:23:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:cfssl: create a refresh only resource to force resigns [puppet] - 10https://gerrit.wikimedia.org/r/798765 (owner: 10Jbond) [15:23:28] <_joe_> vgutierrez: what? [15:23:44] <_joe_> no I did not [15:23:57] _joe_: I've saw you logged on gerrit1001 and I've assumed that you fixed it [15:24:11] <_joe_> vgutierrez: nope [15:24:26] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 56.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:24:30] <_joe_> I'm not even sure that check is performed against gerrit1001 directly [15:25:31] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@644075e]: increase executor jvm heap for convert_to_esbulk [15:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:38] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 72.64 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:27:42] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:53] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@644075e]: increase executor jvm heap for convert_to_esbulk (duration: 02m 22s) [15:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:33] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) I've updated Netbox running the following code: `lang=python >>> import uuid >>> request_id = uuid.uui... [15:29:07] vgutierrez: _joe_: the ports.conf change would have caused an apache reload on gerrit [15:29:25] <_joe_> jbond: yeah but I ran that like 30 minutes ago [15:29:30] <_joe_> puppet, I mean [15:29:52] ack and also dosn;t explain why it would fail then recover [15:29:57] <_joe_> yeah [15:30:01] (03PS1) 10Andrew Bogott: OpenStack nova.conf: set reclaim_instance_interval to half an hour [puppet] - 10https://gerrit.wikimedia.org/r/798772 [15:30:24] <_joe_> vgutierrez: can you check what's the actual check performed by icinga? [15:30:34] gerrit.wm.o [15:30:36] that's the hostname [15:30:52] <_joe_> yeah I mean the whole command [15:30:56] <_joe_> check_httpd? [15:30:58] check_https_expiry!gerrit.wikimedia.org!443 [15:31:30] <_joe_> err the whole command line is not that :) [15:31:46] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [15:31:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:22] _joe_: check_http --ssl --sni --certificate 9,7 -I $HOSTADDRESS$ -H gerrit.wikimedia.org -p 443 [15:32:41] 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10Volans) There is a 5th way and is via Redfish API ;) We do have basic support for redfish API in spicerack right now and there is plan to add suppor... [15:32:45] <_joe_> and hostaddress is I guess gerrit.wikimedia.org [15:32:49] yep [15:33:49] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test.py: increase timeout for DNS check [puppet] - 10https://gerrit.wikimedia.org/r/795730 (https://phabricator.wikimedia.org/T305909) (owner: 10Andrew Bogott) [15:35:57] (03PS5) 10Volans: Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) [15:36:02] (03PS6) 10Volans: Duplicate names by design: add zone validator ignore [dns] - 10https://gerrit.wikimedia.org/r/793728 (https://phabricator.wikimedia.org/T155761) [15:37:12] (03PS1) 10Jbond: C:netbox: Add discovery namer as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777 [15:38:22] (03CR) 10Jbond: [C: 03+2] C:reprepro: ensure /var/lib/reprepro/.bashrc exists [puppet] - 10https://gerrit.wikimedia.org/r/798740 (owner: 10Jbond) [15:39:35] so puppet reloaded apache2 on gerrit1001 at 14:48 and the alert was triggered at 15:20 [15:40:31] (03CR) 10David Caro: [C: 03+1] OpenStack nova.conf: set reclaim_instance_interval to half an hour (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798772 (owner: 10Andrew Bogott) [15:40:42] (03PS2) 10Jbond: C:netbox: Add discovery namer as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777 [15:41:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35525/console" [puppet] - 10https://gerrit.wikimedia.org/r/798777 (owner: 10Jbond) [15:42:52] vgutierrez: that's been flapping for days [15:43:16] (03PS3) 10Jbond: C:netbox: Add discovery name as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777 [15:43:25] RhinosF1: uh? that's interesting [15:43:25] I assumed it was the bug where apache serves old + new cert until restart [15:43:40] vgutierrez: there's a task [15:44:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35526/console" [puppet] - 10https://gerrit.wikimedia.org/r/798777 (owner: 10Jbond) [15:44:25] vgutierrez: https://phabricator.wikimedia.org/T308908#7946277 [15:46:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:netbox: Add discovery name as apache alias [puppet] - 10https://gerrit.wikimedia.org/r/798777 (owner: 10Jbond) [15:47:34] 10SRE, 10MediaWiki-General, 10Wikimedia-production-error: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Joe) [15:49:06] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:49:35] so httpd has some worker lurking around up to 1 month without being killed? [15:49:49] that's pretty bad and not only for TLS material purposes [15:50:16] acme-chief reissues certificates one month before the current one expires [15:50:42] 10SRE, 10MediaWiki-General, 10Wikimedia-production-error: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle) [15:50:49] <_joe_> vgutierrez: I don't think that's it tbh [15:50:57] and gerrit1001 got the new one on April 28th, lrwxrwxrwx 1 root root 54 Apr 28 21:33 live -> /etc/acmecerts/gerrit/3a11664f5fdd45f48b53bd646c3bda1e [15:51:13] vgutierrez: legokt.m links the upstream bug I believe on the task. [15:54:56] vgutierrez: we don't set MaxConnectionsPerChild? [15:55:30] 10SRE, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Vgutierrez) that's intended, every time that acme-chief fetches fresh OCSP stapling responses it issues a reload of apache2 >>! In T293826#7446839, @Legoktm wrote... [15:56:07] volans: nope apparently [15:56:24] I wish there was also a MaxDaysPerChild :D [15:56:47] actually [15:56:49] for low traffic or secondary hosts for example [15:56:50] mods-enabled/mpm_event.conf: MaxConnectionsPerChild 0 [15:57:31] there you go, immortal :D [15:58:09] Is the idea that the lurking old worker is delivering an old copy of the cert? [15:58:31] (03CR) 10Btullis: "Hi, sorry that I'm late to the party here." [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:59:00] RECOVERY - Ganeti memory on ganeti2030 is OK: OK Memory 67% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [15:59:04] dancy: yes [15:59:33] the old version that expires on May 28th [15:59:58] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:04] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:50] 10SRE, 10MediaWiki-File-management, 10MediaWiki-General, 10Wikimedia-production-error: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Joe) [16:03:13] (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:07:27] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10cscott) [16:11:26] (03PS1) 10C. Scott Ananian: CONTRIBUTORS: Add C. Scott Ananian [puppet] - 10https://gerrit.wikimedia.org/r/798800 (https://phabricator.wikimedia.org/T308013) [16:12:09] (03PS2) 10Zabe: tmpreaper: Remove args.erb [puppet] - 10https://gerrit.wikimedia.org/r/797362 [16:15:11] (03CR) 10JHathaway: [C: 03+2] dumps: remove generic python 2.25.1 user agent block [puppet] - 10https://gerrit.wikimedia.org/r/793550 (owner: 10JHathaway) [16:17:57] 10SRE, 10MediaWiki-File-management, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle) [16:18:06] 10SRE, 10MediaWiki-Uploading, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: LocalFile::prerenderThumbnail should have a page limit - https://phabricator.wikimedia.org/T309114 (10Krinkle) [16:28:54] (03PS2) 10KartikMistry: Enable Content and Section Translation in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797977 (https://phabricator.wikimedia.org/T304858) [16:32:12] (03PS1) 10BBlack: ntp.drmrs should use dns6001 [dns] - 10https://gerrit.wikimedia.org/r/798856 [16:32:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298560)', diff saved to https://phabricator.wikimedia.org/P28455 and previous config saved to /var/cache/conftool/dbconfig/20220524-163221-ladsgroup.json [16:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:30] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:33:02] !log gitlab1003 - restarting rsync, trying to debug mysterious "rsync - read-only file system" error we ran into before but could not reproduce [16:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:30] That sounds scary [16:38:43] (03PS2) 10Cathal Mooney: Modifications to install server netboot.cfg ommited in previous change [puppet] - 10https://gerrit.wikimedia.org/r/793520 (https://phabricator.wikimedia.org/T304989) [16:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:42:21] (03CR) 10Cathal Mooney: [C: 03+2] Modifications to install server netboot.cfg ommited in previous change [puppet] - 10https://gerrit.wikimedia.org/r/793520 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [16:45:10] !log aokoth@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28456 and previous config saved to /var/cache/conftool/dbconfig/20220524-164726-ladsgroup.json [16:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:53] (03CR) 10David Caro: "Thanks! LGTM, can you run pcc on it before getting it merged?" [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:48:23] (03CR) 10David Caro: [C: 03+1] "To be merge after the previous one has run a few times right?" [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:50:03] 10Puppet, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES: Restructure ORES labs redis puppet role - https://phabricator.wikimedia.org/T281495 (10elukey) 05Open→03Resolved a:03elukey This has been solved with https://gerrit.wikimedia.org/r/c/operations/puppet/+/785111 in theory, closing... [16:50:04] !log gitlab1003 (gitlab-replica-new) - rebooting for fsck - T307142 [16:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:11] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [16:50:41] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab1003.wikimedia.org with reason: fsck [16:50:44] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab1003.wikimedia.org with reason: fsck [16:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:24] (03PS1) 10BryanDavis: base: remove "managed by puppet" notice on /etc/skel/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/798874 [16:59:55] (03CR) 10Zabe: acme_chief: migrate acme-chief-designate-tidyup cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:00:49] (03CR) 10David Caro: "LGTM, I'll wait for Jbond to do the final ack and merge." [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [17:00:56] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [17:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28457 and previous config saved to /var/cache/conftool/dbconfig/20220524-170231-ladsgroup.json [17:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:12] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [17:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:25] (03PS1) 10Ladsgroup: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798808 [17:06:47] (03PS1) 10Ladsgroup: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798809 [17:09:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [17:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:52] jouncebot: nowandnext [17:11:52] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [17:11:52] In 0 hour(s) and 48 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T1800) [17:12:13] (03CR) 10Ladsgroup: [C: 03+2] Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798808 (owner: 10Ladsgroup) [17:12:15] (03CR) 10Clare Ming: [C: 03+1] mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979) (owner: 10Jdlrobson) [17:12:19] (03CR) 10Ladsgroup: [C: 03+2] Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798809 (owner: 10Ladsgroup) [17:14:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [17:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:41] (03PS1) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 [17:14:58] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks, merging." [puppet] - 10https://gerrit.wikimedia.org/r/798800 (https://phabricator.wikimedia.org/T308013) (owner: 10C. Scott Ananian) [17:16:30] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:16:42] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:17:32] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is OK: 3 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:17:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298560)', diff saved to https://phabricator.wikimedia.org/P28459 and previous config saved to /var/cache/conftool/dbconfig/20220524-171736-ladsgroup.json [17:17:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:17:43] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [17:17:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [17:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:48] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:49] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2002.wikimedia.org with OS bullseye [17:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [17:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:09] (03CR) 10David Caro: [C: 03+1] acme_chief: migrate acme-chief-designate-tidyup cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:18:22] !log failover ganeti master in codfw to ganeti2022 [17:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:37] (03PS1) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [17:21:53] !log aokoth@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [17:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:34] PROBLEM - ganeti-wconfd running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [17:23:21] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [17:23:45] !log gitlab1003 - short downtime for maintenance [17:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:09] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:20] (03PS2) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [17:25:32] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 (resolved, follow-up pending) - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) >>! In T308940#7951736, @Dzahn wrote: > https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_-_varnish_cache_busting "A flood of API traffic from an... [17:31:07] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/35531/" [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [17:32:03] (03Merged) 10jenkins-bot: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798808 (owner: 10Ladsgroup) [17:32:09] (03Merged) 10jenkins-bot: Revert "ApiQueryBacklinksprop: Completely remove index hints" [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798809 (owner: 10Ladsgroup) [17:32:38] (03CR) 10Cwhite: [C: 03+2] logstash: add target index validation step [puppet] - 10https://gerrit.wikimedia.org/r/777891 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [17:33:27] 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) All servers in B12 fixed and the power draw went from 8.9/1.9 to 5.6/5.5 so disabling the 'hot spare' option and splitting the load evenly ends up saving power. Going to give it a bit out of paranoia and then a... [17:35:42] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/api/ApiQueryBacklinksprop.php: Backport: [[gerrit:798808|Revert "ApiQueryBacklinksprop: Completely remove index hints"]] (duration: 00m 50s) [17:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:49] (03PS1) 10Cathal Mooney: Add v6 reverse zone for Vlan1116 / cloudsw1-c8 to cloudsw1-d5 linknet [dns] - 10https://gerrit.wikimedia.org/r/798893 (https://phabricator.wikimedia.org/T304936) [17:36:00] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:00] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [17:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:58] (KubernetesRsyslogDown) firing: (10) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:39:36] 10SRE, 10ops-drmrs, 10DC-Ops, 10Traffic: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) [17:39:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:44] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [17:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:40:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:16] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:14] 10ops-drmrs: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) [17:50:45] 10ops-drmrs, 10Traffic-Icebox: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) This is a dns server, so we'll have to check with traffic before we go taking it offline for repair. [17:50:59] 10ops-drmrs, 10Traffic: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) [17:52:27] 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) 05Open→03Resolved All hosts except dns6002 have been fixed. T309124 filed for dns6002 repair [17:53:03] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [17:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:52] Starting train operations [17:59:55] 10ops-drmrs, 10Traffic: dns6002 https idrac produces 400 error - https://phabricator.wikimedia.org/T309124 (10RobH) 05Open→03Resolved fixed the power via the idrac ssh cli [17:59:59] 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) [18:00:11] (03CR) 10BBlack: [C: 03+2] ntp.drmrs should use dns6001 [dns] - 10https://gerrit.wikimedia.org/r/798856 (owner: 10BBlack) [18:02:14] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.39.0-wmf.13 refs T305219 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798915 [18:02:16] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.39.0-wmf.13 refs T305219 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798915 (owner: 10Ahmon Dancy) [18:02:57] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.13 refs T305219 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798915 (owner: 10Ahmon Dancy) [18:03:54] !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.13 refs T305219 [18:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:01] T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 [18:06:11] (03CR) 10BBlack: [C: 03+2] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/798893 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [18:06:17] (03PS2) 10BBlack: Add v6 reverse zone for Vlan1116 / cloudsw1-c8 to cloudsw1-d5 linknet [dns] - 10https://gerrit.wikimedia.org/r/798893 (https://phabricator.wikimedia.org/T304936) (owner: 10Cathal Mooney) [18:06:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:49] 10SRE, 10ops-drmrs, 10DC-Ops, 10Traffic: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) It fixed itself with reboot ` Normal,Tue 24 May 2022 18:06:22,The self-heal operation successfully completed at DIMM DIMM_B2., Normal,Tue 24 May 2022 18:06:22,The self-h... [18:07:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:07:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:36] 10SRE, 10ops-drmrs, 10DC-Ops, 10Traffic: hw troubleshooting: cp6006 b2 dimm issue - https://phabricator.wikimedia.org/T309123 (10RobH) [18:08:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:17] (03CR) 10Jbond: [C: 03+1] "thanks this looks really good, have let some minor nits but no blockers will merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [18:15:25] (03CR) 10Jbond: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/798874 (owner: 10BryanDavis) [18:18:02] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:33:43] (03CR) 10BBlack: [WIP] esitest service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [18:34:12] (03PS5) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) [18:36:07] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:36:44] !log deploying analytics refinery as part of the weekly deployment [18:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:09] !log ebysans@deploy1002 Started deploy [analytics/refinery@8314d31]: Regular analytics weekly train [analytics/refinery@8314d31] [18:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:52] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) While I was out, they closed the task and I had to re-open. They will be sending a new motherboard was where it left off. New ticket Successfully Submitted Case Number: 5... [18:41:15] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:41:57] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Cmjohnson) @wiki_willy @Jclark-ctr We do not have a spare on-site. [18:42:47] (03PS2) 10Ryan Kemper: elastic: 2060 is in row D, not C [puppet] - 10https://gerrit.wikimedia.org/r/779547 [18:44:36] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10Traffic, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10Krinkle) [18:44:51] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:44:59] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:45:05] (03PS2) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [18:45:29] (03CR) 10Gehel: elastic: add reimage to rolling-operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [18:45:39] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.13 refs T305219 (duration: 41m 45s) [18:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:45] T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 [18:46:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:46:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:07] (03PS6) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) [18:48:08] (03CR) 10CI reject: [V: 04-1] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [18:48:11] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.9, 1.39.0-wmf.8, 1.39.0-wmf.10 (duration: 02m 28s) [18:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:07] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) a:03Krinkle [18:52:11] (03PS3) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [18:52:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:42] 10SRE, 10Performance-Team, 10Wikimedia-Site-requests, 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle) [18:53:50] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) [18:53:53] 10SRE, 10Performance-Team, 10Wikimedia-Site-requests, 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle) [18:53:56] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) [18:54:06] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) [18:54:12] 10SRE, 10Performance-Team, 10Wikimedia-Site-requests, 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle) 05duplicate→03Open [18:55:10] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Krinkle) >>! In T308893#7947393, @Alexey_Skripnik wrote: > User Vladis13 did a great job importing some public domain texts in Russian Wikisource... [18:55:17] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Krinkle) [18:59:54] (03CR) 10CI reject: [V: 04-1] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [19:01:49] !log ebysans@deploy1002 Finished deploy [analytics/refinery@8314d31]: Regular analytics weekly train [analytics/refinery@8314d31] (duration: 23m 40s) [19:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:09] (03PS4) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [19:04:46] (03PS1) 10Ahmon Dancy: group0 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798971 (https://phabricator.wikimedia.org/T305219) [19:04:48] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798971 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy) [19:05:01] (03CR) 10Ryan Kemper: [C: 03+2] elastic: 2060 is in row D, not C [puppet] - 10https://gerrit.wikimedia.org/r/779547 (owner: 10Ryan Kemper) [19:06:01] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798971 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy) [19:06:35] (03PS5) 10Gehel: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [19:07:16] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.13 refs T305219 [19:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:23] T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 [19:09:03] (03PS6) 10Ryan Kemper: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [19:10:20] (03PS2) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 [19:12:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:20] (03PS11) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [19:13:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:13:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:25] !log ebysans@deploy1002 Started deploy [analytics/refinery@8314d31] (thin): Regular analytics weekly train THIN [analytics/refinery@8314d31] [19:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:31] (03CR) 10Ryan Kemper: elastic: add reimage to rolling-operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [19:14:33] !log ebysans@deploy1002 Finished deploy [analytics/refinery@8314d31] (thin): Regular analytics weekly train THIN [analytics/refinery@8314d31] (duration: 00m 08s) [19:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:13] !log ebysans@deploy1002 Started deploy [analytics/refinery@8314d31] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8314d31] [19:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:47] (03PS7) 10Ryan Kemper: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [19:16:51] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [19:18:22] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10wiki_willy) Hi @BTullis - I noticed analytics1068 has a failed status and is set to be refreshed after @Cmjohnson finishes up T293922. As a quick fix, would we be able to pull the RA... [19:18:32] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:22] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10aaron) It would be good to look at the performance of pages at https://ru.wikisource.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1... [19:21:20] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:25] T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606 [19:22:14] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:34] !log ebysans@deploy1002 Finished deploy [analytics/refinery@8314d31] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8314d31] (duration: 07m 21s) [19:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:40] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:23:36] ^^ sry this was me adding new peerings here. [19:23:58] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [19:23:58] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:24:01] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:08] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:34] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:50] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:25] (03PS1) 10Ryan Kemper: elastic: rolling reimage is missing req os arg [cookbooks] - 10https://gerrit.wikimedia.org/r/798973 (https://phabricator.wikimedia.org/T308606) [19:28:13] (03CR) 10Bking: [V: 03+1] elastic: rolling reimage is missing req os arg [cookbooks] - 10https://gerrit.wikimedia.org/r/798973 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [19:28:35] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: rolling reimage is missing req os arg [cookbooks] - 10https://gerrit.wikimedia.org/r/798973 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [19:29:17] (03PS1) 10Cwhite: beta-logs: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798974 (https://phabricator.wikimedia.org/T305175) [19:29:37] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@3ae51e7]: (no justification provided) [19:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:45] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@3ae51e7]: (no justification provided) (duration: 00m 08s) [19:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:15] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:21] T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606 [19:31:03] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:22] PROBLEM - IPMI Sensor Status on aqs1014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:34:18] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:39:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:52] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:42:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:42:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:16] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [19:42:16] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:10] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [19:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:16] T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606 [19:46:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:46:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:02] (03PS1) 10Bartosz Dziewoński: Follow-up I97c27fd7: Fix after-edit reload in source editor [extensions/MobileFrontend] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798811 (https://phabricator.wikimedia.org/T309068) [19:47:31] (03PS4) 10Bartosz Dziewoński: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders) [19:48:09] (03CR) 10Bartosz Dziewoński: [C: 03+1] Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders) [19:49:27] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host relforge1003.eqiad.wmnet with OS bullseye [19:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:38] (03PS1) 10Bartosz Dziewoński: Update beta cluster DiscussionTools A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798976 (https://phabricator.wikimedia.org/T304030) [19:53:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:43] (03PS1) 10Eevans: Allow `LOGIN` for image_suggestions Cassandra user [puppet] - 10https://gerrit.wikimedia.org/r/798977 [19:54:02] (03CR) 10Cwhite: [C: 03+2] beta-logs: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798974 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [19:54:31] !Log Refinery Deployment is complete [19:55:14] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:26] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:55:37] (03CR) 10Eevans: [C: 04-1] "I'm marking this -1 until we've established that this fixes the current connection failures. If it does we can merge it (it's already bee" [puppet] - 10https://gerrit.wikimedia.org/r/798977 (owner: 10Eevans) [19:55:44] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:59:33] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage [19:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RoanKattouw, Urbanecm, and cjming: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220524T2000). [20:00:04] Tran, zabe, cjming, koi, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] heya o/ [20:00:44] 👋 I'm here! [20:00:51] hi [20:00:53] here/ [20:01:25] hi all - i can deploy [20:01:35] if anyone can/wants to self-serve, just lmk [20:02:21] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage [20:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:33] Tran: I'll start with your patches [20:02:34] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Alexey_Skripnik) This is one of the longest pagest in Russian Wikisource: https://ru.wikisource.org/wiki/%D0%A4%D0%B8%D0%BD%D0%B8%... [20:02:47] :+1 [20:02:50] (03CR) 10Clare Ming: [C: 03+2] Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [20:03:01] (03PS1) 10Ryan Kemper: elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606) [20:03:50] (03Merged) 10jenkins-bot: Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [20:04:04] ACKNOWLEDGEMENT - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service Ryan Kemper https://phabricator.wikimedia.org/T308606 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:04] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) Ryan Kemper https://phabricator.wikimedia.org/T308606 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:04:04] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) Ryan Kemper https://phabricator.wikimedia.org/T308606 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:04:59] Tran: I think bec your 1st patch is labs, I can move on with your 2nd? When confirming rebase on master, it says it breaks the relation chain but i'm assuming that's ok? [20:05:02] (03PS2) 10Ryan Kemper: elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606) [20:05:45] Yes that's fine. The first two are just comments. [20:05:54] (03PS2) 10Clare Ming: Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [20:06:25] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Alexey_Skripnik) In comparison, this is some **random short page** from Russian Wikisource: https://ru.wikisource.org/wiki/%D0%95%... [20:07:15] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:07:43] seeing general slowness, one report of "upstream connect error or disconnect/reset before headers. reset reason: overflow" [20:08:11] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:08:14] (03PS1) 10Cwhite: logstash: curator support new and legacy index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798982 (https://phabricator.wikimedia.org/T305175) [20:08:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:08:18] (ProbeDown) firing: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:29] looking [20:08:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:08:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:08:55] !log cjming@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:793848|Remove outdated comment about IPInfo from CommonSettings-labs.php (T308876)]] (duration: 00m 49s) [20:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:08:56] cjming: pause deploying please [20:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:59] T308876: Improve comments in mediawiki-config about IPInfo - https://phabricator.wikimedia.org/T308876 [20:09:03] rzl: ok [20:09:12] (03CR) 10Bking: [C: 03+1] elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [20:09:19] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:20] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:09:23] (03CR) 10Ryan Kemper: [C: 03+2] elastic: log return value of reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/798981 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [20:09:32] cjming not sure if this is deployment related or not, checking, but if you rolled anything out in the last few minutes, please prepare a rollback and don't merge it yet [20:09:34] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:09:47] * jbond here i can ic will set up the doc [20:09:54] jbond: ack,thanks [20:10:19] rzl: i deployed the first patch is all https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793848 [20:10:21] here too [20:10:28] here as well [20:10:48] rzl: should it be rolled back? [20:10:52] the only deployed patch is labs-only. It doesn't seem like it could have caused this. [20:10:59] looks like a spike of DB queries to s5 that saturated php-fpm workers, seems like it's already cleared [20:11:09] that's what I was thinking - i.e. labs only [20:11:13] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:11:16] cjming: yeah, safe to assume that's unrelated, just sit tight for a minute, thank you [20:11:23] rzl: sure thing [20:11:28] sorry, just ruling stuff out :D [20:11:42] rzl: i'll wait for your green light before proceeding [20:11:48] perfect, thanks, will let you know [20:12:53] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s5&var-role=All&from=now-3h&to=now s5 did see a traffic spike but recovered, still digging [20:13:18] (ProbeDown) resolved: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:13:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:13:44] aside from current s5, note also that s5 replag has been growing since ~4.5h ago, not sure if that's a problem or related [20:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:13:56] https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy?orgId=1&viewPanel=6 [20:14:19] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:34] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:15:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:50] rzl: do we know more then "issue with S5" [20:16:53] jbond: issue driven by about a 6x spike in qps, and cwhite has some information in the other channel, but that's where we're at as far as I know [20:17:06] thx [20:18:45] https://orchestrator.wikimedia.org/web/cluster/alias/s5 shows high replication lag to db1154 but I think it's still depooled [20:19:01] ^ any SRE with a pair of hands free, can you verify that please? [20:19:23] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:19:46] I lag to dbstore1003 and to codfw also but I'm not worried about that right now [20:20:22] it got downtimed for 2 days [20:20:52] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1003.eqiad.wmnet with OS bullseye [20:20:54] I think Amir is running a schema change on them [20:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:10] T298560 [20:21:11] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [20:21:13] rzl: that db is still depooled [20:21:17] zabe: yes :) what I need to find out is whether it's still depooled but thank you [20:21:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:22] jbond: rad thanks [20:22:09] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:26:25] cjming: fyi we've switched channels but still digging, haven't forgotten you :) it looks like we're stable but we'd like to get a better sense of what's going on before we unblock deploys, will still let you know [20:26:31] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:26:31] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:26:46] rzl: sounds good - i'll be standing by [20:36:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [20:38:25] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:51] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 122 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 152, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 120, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [20:40:51] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 55.47445255474452 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:42:19] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:44:15] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:55] cjming: all clear, thanks for your patience! [20:52:11] rzl: thanks! [20:52:25] (03CR) 10Clare Ming: [C: 03+2] Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [20:52:43] <_joe_> cjming: sorry for the wait [20:52:53] Tran: I'm going to sync your 2nd patch since it's just a comment as well [20:53:03] _joe_: np! glad it all got sorted out [20:53:06] 👍 thanks! [20:53:29] (03Merged) 10jenkins-bot: Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [20:53:32] <_joe_> we were trying to be sure of the root cause so that if the problem happens again we won't get in your way :) [20:54:20] gtk we're all in good hands [20:54:44] (03PS2) 10Clare Ming: Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders) [20:54:48] !log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:793849|Add comment to consult Legal before updating IPInfo access (T308876)]] (duration: 00m 52s) [20:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:55] T308876: Improve comments in mediawiki-config about IPInfo - https://phabricator.wikimedia.org/T308876 [20:55:47] (03CR) 10Clare Ming: [C: 03+2] Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders) [20:56:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:21] (03Merged) 10jenkins-bot: Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders) [20:57:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:57:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:57] Tran: is your 3rd patch something that can be checked? on mwdebug1001 [20:58:03] otherwise I can just sync [20:58:11] Yes I think I can check the version page to see if it's installed. Please hold [20:58:15] PROBLEM - Host ml-serve1007 is DOWN: PING CRITICAL - Packet loss = 100% [20:58:32] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10BTullis) Hi Willy, That works for me. Shall I shut down analytics1068 at a convenient time tomorrow? Many thanks, Ben [20:58:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:24] Yes I can confirm it's installed on mwdebug1001 [20:59:29] cool - syncing [20:59:56] (03PS5) 10Clare Ming: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:00:28] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793841|Deploy IPInfo to all wikis by default (T260597)]] (duration: 00m 52s) [21:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:34] T260597: Deploy IP Info extension to all wikis (as a beta feature) - https://phabricator.wikimedia.org/T260597 [21:00:36] Tran: should be live [21:00:47] Looks good thank you! [21:00:53] np! [21:01:00] Zabe: I can do yours next if you're still around [21:01:14] i am still here [21:01:19] (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:02:09] RECOVERY - Host ml-serve1007 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [21:02:15] (03CR) 10Clare Ming: [C: 03+2] mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979) (owner: 10Jdlrobson) [21:02:53] (03Merged) 10jenkins-bot: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:03:24] Zabe: is your patch testable? on mwdebug1001 [21:03:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:04:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:04:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:07] cjming, it's not really testable. I made sure editing does not result in fatals. I will keep an eye on logstash after you synced it. [21:05:16] sounds good - syncing then [21:05:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:52] koi: doing my patch real quick and will do yours next if you're still around [21:06:10] still here [21:06:18] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797294|Start writing to cuc_actor in s3, kcgwiki and labtestwiki (T233004)]] (duration: 00m 52s) [21:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:23] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:06:24] Zabe: should be live [21:06:41] ok, thanks :) [21:07:58] (KubernetesRsyslogDown) firing: (9) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:08:53] (03PS2) 10Clare Ming: zhwikisource: Adjust workmark size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792971 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [21:10:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:14:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:14:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:13] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10aaron) It would be good to look at the performance of pages at https://he.wikisource.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%93%D7%A4%D7%99%... [21:15:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:52] (03PS1) 10Catrope: CONTRIBUTORS: Add myself (Roan Kattouw) [puppet] - 10https://gerrit.wikimedia.org/r/798991 (https://phabricator.wikimedia.org/T308013) [21:21:44] koi: sorry - my patch is taking forever to merge -- it's almost there [21:22:03] (03Merged) 10jenkins-bot: mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979) (owner: 10Jdlrobson) [21:23:35] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.12/resources/src/mediawiki.skinning/accessibility.less: Backport: [[gerrit:797219|mediawiki.skinning: `transition-duration` accessibility override set to `0` (T308979)]] (duration: 00m 51s) [21:23:37] (03CR) 10Clare Ming: [C: 03+2] zhwikisource: Adjust workmark size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792971 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [21:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:42] T308979: Infinite motion when "Reduces motion" is enabled on mobile device for skins that are not responsive (Modern, Vector legacy) - https://phabricator.wikimedia.org/T308979 [21:24:30] (03Merged) 10jenkins-bot: zhwikisource: Adjust workmark size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792971 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [21:24:59] koi: can you check mwdebug1001? [21:25:07] looking [21:25:33] cjming: LGTM [21:25:39] great - syncing [21:26:41] !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wikisource-wordmark-zh.svg: Config: [[gerrit:792971|zhwikisource: Adjust workmark size (T308620)]] (duration: 00m 50s) [21:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:47] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [21:27:20] Column 'cuc_actor' cannot be null [21:27:21] bah [21:27:53] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792971|zhwikisource: Adjust workmark size (T308620)]] (duration: 00m 50s) [21:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:14] koi: should be live - purged the svg [21:28:48] indeed, thx [21:29:26] MatmaRex: are you still around? happy to do your patches unless you'd like to self-serve [21:29:50] cjming: yeah, i'm around if you're still deploying [21:30:09] (i don't have deploy access) [21:30:14] sure - np [21:30:25] (03PS5) 10Clare Ming: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders) [21:30:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:06] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Jclark-ctr) @BTullis i am available tomorrow at 3pm est [21:31:41] (03CR) 10Clare Ming: [C: 03+2] Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders) [21:31:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:31:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:31:44] (03CR) 10Ori: [C: 03+2] CONTRIBUTORS: Add myself (Roan Kattouw) [puppet] - 10https://gerrit.wikimedia.org/r/798991 (https://phabricator.wikimedia.org/T308013) (owner: 10Catrope) [21:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:26] (03Merged) 10jenkins-bot: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) (owner: 10Esanders) [21:33:10] MatmaRex: your 1st patch is on mwdebug1001 if it's verifiable [21:33:46] (03CR) 10Clare Ming: [C: 03+2] Update beta cluster DiscussionTools A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798976 (https://phabricator.wikimedia.org/T304030) (owner: 10Bartosz Dziewoński) [21:33:46] cjming: it should be a no-op [21:33:56] alrighty then - syncing [21:34:11] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reimage - ryankemper@cumin1001 - T308606 [21:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:16] T308606: Add reimage to Elastic rolling-operation cookbook - https://phabricator.wikimedia.org/T308606 [21:34:56] !log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771872|Disable autotopicsub user option by default (T297966)]] (duration: 00m 48s) [21:35:02] (03Merged) 10jenkins-bot: Update beta cluster DiscussionTools A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798976 (https://phabricator.wikimedia.org/T304030) (owner: 10Bartosz Dziewoński) [21:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:06] T297966: Auto topic subscription should be enabled by default on 3rd party installs - https://phabricator.wikimedia.org/T297966 [21:35:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:35:57] (03CR) 10Clare Ming: [C: 03+2] Follow-up I97c27fd7: Fix after-edit reload in source editor [extensions/MobileFrontend] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798811 (https://phabricator.wikimedia.org/T309068) (owner: 10Bartosz Dziewoński) [21:36:37] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:798976|Update beta cluster DiscussionTools A/B test config (T304030)]] (duration: 00m 49s) [21:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:44] T304030: Implement Topic Subscriptions A/B test bucketing - https://phabricator.wikimedia.org/T304030 [21:37:01] MatmaRex: just waiting for your last patch to merge [21:37:44] thanks [21:38:54] cjming, sorry, but I need to revert my patch, it's causing fatals [21:39:09] zabe: ok [21:40:03] (03PS1) 10Zabe: Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004) [21:40:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:41:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:50] (03PS2) 10Clare Ming: Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:46:19] zabe: just waiting for the current patch to go thru and we can do your revert [21:46:29] ok :) [21:52:13] so what is taking 15 minutes there [21:52:20] oh, selenium tests [21:52:49] ya - so slow [21:53:34] (03Merged) 10jenkins-bot: Follow-up I97c27fd7: Fix after-edit reload in source editor [extensions/MobileFrontend] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798811 (https://phabricator.wikimedia.org/T309068) (owner: 10Bartosz Dziewoński) [21:54:13] apparently it takes 2 minutes just to install the npm dependencies for them. truly we're doomed [21:54:24] lol [21:54:40] MatmaRex: your last patch is on mwdebug1001 if you can confirm [21:54:56] yeah. testing [21:55:19] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 51420 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [21:55:30] awww. man [21:55:37] looks fixed at https://m.mediawiki.org/wiki/Project:Sandbox (i did a null edit) [21:55:45] cool - syncing then [21:56:50] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/MobileFrontend: Backport: [[gerrit:798811|Follow-up I97c27fd7: Fix after-edit reload in source editor (T309068)]] (duration: 00m 48s) [21:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:56] MatmaRex: should be live [21:56:57] T309068: [betalabs-mobile] Publishing edits from source editor re-opens page in editing mode - https://phabricator.wikimedia.org/T309068 [21:57:05] thank you cjming. have a good evening [21:57:10] thanks! you too [21:57:18] ok zabe: onto your revert [21:57:25] (03CR) 10Clare Ming: [C: 03+2] Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:57:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:41] (oh, still not done :( ) [21:58:20] MatmaRex: ? [21:58:28] (03Merged) 10jenkins-bot: Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/798813 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:58:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:58:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:56] cjming: sorry, everything's alright, i'm just bemoaning the deployment running over :) [21:59:15] appreciate the empathy lol [21:59:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:21] zabe: i can go ahead and sync [22:00:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Jclark-ctr) All host racked , powered and management. will update once network is completed name rack position mw1457 A8 1 mw1458 A8 2 mw1459 A8 3 mw1460 A8 12 mw14... [22:01:01] unless you want to verify on mwdebug1001 [22:01:19] gonna sync cuz i gotta run [22:02:14] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:798813|Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki" (T233004 T309148)]] (duration: 00m 49s) [22:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:22] T309148: Wikimedia\Rdbms\DBQueryError: Error 1048: Column 'cuc_actor' cannot be nullFunction: MediaWiki\CheckUser\Hooks::updateCheckUserDataQuery: INSERT INTO `cu_changes` (cuc_namespace,cuc_title,cuc_minor,cuc_user,cuc_user_text,cuc_ - https://phabricator.wikimedia.org/T309148 [22:02:22] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:03:16] yes [22:03:33] thanks for helping with this :) [22:03:34] !log centrallog2002 - alerted because running out of disk. /srv/syslog# find . -name *.gz -mtime +100 -delete [22:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:17] zabe: np ! revert should be live [22:04:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:04:41] !log end of UTC late backport window [22:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:08:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:03] (03CR) 10Dzahn: "what Majavah says,"restricted" will give access to mediawiki::maintenance hosts and deployment isn't really needed afaict." [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [22:15:54] (03CR) 10Krinkle: Move out ORES extension configuration out of InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [22:19:00] (03CR) 10Dzahn: "looking at https://phabricator.wikimedia.org/T307452#7930485 if this is really just that command and it's running twice a week... would i" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [22:21:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) >>! In T309045#7950982, @MShilova_WMF wrote: > I confirm that @sgs needs access to a production server and it is currently blocking {https://phabric... [22:22:53] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 23.42 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:23:07] ^ OK [22:24:17] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 41.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:24:17] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 41.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:25:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:26:31] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 81.09 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:26:31] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 99.39 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:28:17] (03PS1) 10Cwhite: logstash: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) [22:42:05] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:46] (03CR) 10BryanDavis: [C: 03+2] Add developer-portal chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [22:51:27] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 268, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma [22:51:27] g_in_queue_millis: 0, active_shards_percent_as_number: 97.8102189781022 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:52:19] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 137, active_shards: 273, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma [22:52:19] g_in_queue_millis: 0, active_shards_percent_as_number: 99.63503649635037 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:52:49] (03Merged) 10jenkins-bot: Add developer-portal chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [23:10:01] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:12:07] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Krinkle) [23:15:12] (03PS1) 10MewOphaswongse: Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799007 (https://phabricator.wikimedia.org/T309152) [23:15:28] (03PS1) 10MewOphaswongse: Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152) [23:40:09] (03CR) 10CI reject: [V: 04-1] Add an image: Attach view image details button to .mw-ge-recommendedImage-imageWrapper [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/799008 (https://phabricator.wikimedia.org/T309152) (owner: 10MewOphaswongse) [23:43:13] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:44:56] (03PS7) 10BryanDavis: helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [23:55:42] (03CR) 10BryanDavis: [C: 03+1] "I plan to merge and deploy this for the first time on 2022-05-25." [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah)