2021-02-15 01:07:53
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 01:13:05
|
<icinga-wm>
|
PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 02:13:17
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 02:18:25
|
<icinga-wm>
|
PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 02:31:55
|
<icinga-wm>
|
PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5870 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
|
2021-02-15 02:33:31
|
<icinga-wm>
|
RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
|
2021-02-15 02:42:29
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 02:47:15
|
<icinga-wm>
|
PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 03:08:29
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 03:13:39
|
<icinga-wm>
|
PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 03:34:49
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2021-02-15 03:35:57
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2021-02-15 04:43:13
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 04:48:23
|
<icinga-wm>
|
PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 05:02:27
|
<icinga-wm>
|
PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
|
2021-02-15 05:04:03
|
<icinga-wm>
|
RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
|
2021-02-15 06:02:28
|
<wikibugs>
|
'SRE, ''DBA: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (''Marostegui) p:''Triage→''Medium a:''Kormat Yeah, as far as I remember we're not using this for anything Assigning it for Stevie for confirmation and removal (if that
applies)'
|
2021-02-15 06:10:23
|
<wikibugs>
|
'SRE, ''ops-eqiad, ''DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (''Marostegui) Thanks everyone who responded to this incident!'
|
2021-02-15 06:17:33
|
<wikibugs>
|
'SRE, ''DBA, ''Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (''Marostegui) >>! In T258361#6822070, @jcrespo wrote: > I am taking db1163 to, at least temporarily, substitute db1134 due to T274472. Thanks. I...'
|
2021-02-15 06:19:15
|
<wikibugs>
|
('PS1) ''Marostegui: db1162: Enable notifications [puppet] - ''https://gerrit.wikimedia.org/r/664087 (https://phabricator.wikimedia.org/T258361)'
|
2021-02-15 06:20:14
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] db1162: Enable notifications [puppet] - ''https://gerrit.wikimedia.org/r/664087 (https://phabricator.wikimedia.org/T258361) (owner: ''Marostegui)'
|
2021-02-15 06:36:31
|
<wikibugs>
|
('PS1) ''Marostegui: instances.yaml: Add db1162 to dbctl [puppet] - ''https://gerrit.wikimedia.org/r/664088 (https://phabricator.wikimedia.org/T258361)'
|
2021-02-15 06:37:05
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] instances.yaml: Add db1162 to dbctl [puppet] - ''https://gerrit.wikimedia.org/r/664088 (https://phabricator.wikimedia.org/T258361) (owner: ''Marostegui)'
|
2021-02-15 06:40:02
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1162 to dbctl - depooled T258361', diff saved to https://phabricator.wikimedia.org/P14339 and previous config saved to /var/cache/conftool/dbconfig/20210215-064001-marostegui.json
|
2021-02-15 06:40:06
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 06:40:08
|
<stashbot>
|
T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
|
2021-02-15 06:46:28
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14340 and previous config saved to /var/cache/conftool/dbconfig/20210215-064628-marostegui.json
|
2021-02-15 06:46:32
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 06:46:33
|
<stashbot>
|
T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
|
2021-02-15 06:56:50
|
<wikibugs>
|
('PS1) ''Marostegui: install_server: Do not reimage db1162 and db1163 [puppet] - ''https://gerrit.wikimedia.org/r/664089'
|
2021-02-15 06:57:31
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] install_server: Do not reimage db1162 and db1163 [puppet] - ''https://gerrit.wikimedia.org/r/664089 (owner: ''Marostegui)'
|
2021-02-15 06:58:07
|
<icinga-wm>
|
RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 07:02:06
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14341 and previous config saved to /var/cache/conftool/dbconfig/20210215-070206-marostegui.json
|
2021-02-15 07:02:10
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:02:12
|
<stashbot>
|
T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
|
2021-02-15 07:09:46
|
<wikibugs>
|
'SRE, ''ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (''elukey) ''Resolved→''Open ms-be1034 is down again, same issue as the one described by Filippo... :('
|
2021-02-15 07:10:31
|
<icinga-wm>
|
ACKNOWLEDGEMENT - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T274488
|
2021-02-15 07:14:17
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet
|
2021-02-15 07:14:20
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:16:37
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet
|
2021-02-15 07:16:40
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:20:41
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet
|
2021-02-15 07:20:44
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:22:37
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet
|
2021-02-15 07:22:40
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:24:21
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet
|
2021-02-15 07:24:23
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:26:40
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet
|
2021-02-15 07:26:44
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:28:21
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet
|
2021-02-15 07:28:26
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:33:24
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet
|
2021-02-15 07:33:29
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:38:23
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 07:42:54
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes - elukey@cumin1001
|
2021-02-15 07:42:57
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:43:33
|
<icinga-wm>
|
PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 07:47:21
|
<wikibugs>
|
('PS1) ''ArielGlenn: wikidata json dumps: re-add source of shared functions [puppet] - ''https://gerrit.wikimedia.org/r/664090'
|
2021-02-15 07:48:16
|
<wikibugs>
|
('CR) ''ArielGlenn: [C: ''+2] wikidata json dumps: re-add source of shared functions [puppet] - ''https://gerrit.wikimedia.org/r/664090 (owner: ''ArielGlenn)'
|
2021-02-15 07:49:32
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 3%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14342 and previous config saved to /var/cache/conftool/dbconfig/20210215-074932-root.json
|
2021-02-15 07:49:35
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 07:57:16
|
<wikibugs>
|
('PS1) ''ArielGlenn: now that snapshot1005 is testbed host, make snapshot1007 the enwiki dumps runner [puppet] - ''https://gerrit.wikimedia.org/r/664091 (https://phabricator.wikimedia.org/T269377)'
|
2021-02-15 08:04:37
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 4%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14343 and previous config saved to /var/cache/conftool/dbconfig/20210215-080435-root.json
|
2021-02-15 08:04:40
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 08:07:33
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 08:08:21
|
<wikibugs>
|
('CR) ''ArielGlenn: [C: ''+2] now that snapshot1005 is testbed host, make snapshot1007 the enwiki dumps runner [puppet] - ''https://gerrit.wikimedia.org/r/664091 (https://phabricator.wikimedia.org/T269377) (owner: ''ArielGlenn)'
|
2021-02-15 08:10:50
|
<wikibugs>
|
('PS1) ''ArielGlenn: prep snapshot1005 and 1006 for reinstall with buster [puppet] - ''https://gerrit.wikimedia.org/r/664092 (https://phabricator.wikimedia.org/T269377)'
|
2021-02-15 08:13:14
|
<wikibugs>
|
('CR) ''ArielGlenn: [C: ''+2] prep snapshot1005 and 1006 for reinstall with buster [puppet] - ''https://gerrit.wikimedia.org/r/664092 (https://phabricator.wikimedia.org/T269377) (owner: ''ArielGlenn)'
|
2021-02-15 08:17:33
|
<wikibugs>
|
'SRE, ''Dumps-Generation, ''Platform Engineering, ''serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (''ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1005.eqiad.wmnet ` The log can be fo...'
|
2021-02-15 08:19:41
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14344 and previous config saved to /var/cache/conftool/dbconfig/20210215-081940-root.json
|
2021-02-15 08:26:51
|
<wikibugs>
|
'SRE, ''DBA, ''Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (''Marostegui)'
|
2021-02-15 08:27:19
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 T274235', diff saved to https://phabricator.wikimedia.org/P14345 and previous config saved to /var/cache/conftool/dbconfig/20210215-082718-marostegui.json
|
2021-02-15 08:27:47
|
<icinga-wm>
|
PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 08:29:05
|
<gehel>
|
!log powercycle wdqs1009
|
2021-02-15 08:29:22
|
<wikibugs>
|
('PS1) ''Marostegui: db1075: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/664093 (https://phabricator.wikimedia.org/T274235)'
|
2021-02-15 08:29:24
|
<wikibugs>
|
('PS1) ''Elukey: profile::hadoop::backup::namenode: add a more precise notes_url [puppet] - ''https://gerrit.wikimedia.org/r/664094'
|
2021-02-15 08:29:25
|
<logmsgbot>
|
!log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1005.eqiad.wmnet with reason: REIMAGE
|
2021-02-15 08:30:06
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] db1075: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/664093 (https://phabricator.wikimedia.org/T274235) (owner: ''Marostegui)'
|
2021-02-15 08:31:30
|
<logmsgbot>
|
!log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1005.eqiad.wmnet with reason: REIMAGE
|
2021-02-15 08:31:48
|
<wikibugs>
|
('PS1) ''JMeybohm: tiller: Run tiller as user nobody [docker-images/production-images] - ''https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254)'
|
2021-02-15 08:31:50
|
<wikibugs>
|
('PS1) ''JMeybohm: eventrouter: Use numeric UID [docker-images/production-images] - ''https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254)'
|
2021-02-15 08:31:52
|
<wikibugs>
|
('PS1) ''JMeybohm: fluent-bit: Use numeric UID [docker-images/production-images] - ''https://gerrit.wikimedia.org/r/664097 (https://phabricator.wikimedia.org/T274254)'
|
2021-02-15 08:31:57
|
<wikibugs>
|
('PS1) ''JMeybohm: ratelimit: Use numeric UID [docker-images/production-images] - ''https://gerrit.wikimedia.org/r/664098 (https://phabricator.wikimedia.org/T274254)'
|
2021-02-15 08:32:20
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+2] profile::hadoop::backup::namenode: add a more precise notes_url [puppet] - ''https://gerrit.wikimedia.org/r/664094 (owner: ''Elukey)'
|
2021-02-15 08:34:44
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14346 and previous config saved to /var/cache/conftool/dbconfig/20210215-083444-root.json
|
2021-02-15 08:44:12
|
<wikibugs>
|
('PS1) ''Elukey: hadoop: enable HDFS service port for Analytics Hadoop [puppet] - ''https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629)'
|
2021-02-15 08:45:24
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+1] "Nice!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: ''Kosta Harlan)'
|
2021-02-15 08:47:53
|
<wikibugs>
|
('CR) ''Elukey: [V: ''+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28056/console"; [puppet] - ''https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629)
(owner: ''Elukey)'
|
2021-02-15 08:48:01
|
<logmsgbot>
|
!log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
|
2021-02-15 08:49:48
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 15%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14347 and previous config saved to /var/cache/conftool/dbconfig/20210215-084947-root.json
|
2021-02-15 08:50:59
|
<wikibugs>
|
'ops-eqiad, ''DC-Ops, ''Wikidata, ''Wikidata-Query-Service: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (''Gehel)'
|
2021-02-15 08:53:53
|
<wikibugs>
|
'SRE, ''Dumps-Generation, ''Platform Engineering, ''serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (''ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1005.eqiad.wmnet'] ` and were **ALL** successful.'
|
2021-02-15 08:58:58
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes - elukey@cumin1001
|
2021-02-15 09:01:22
|
<wikibugs>
|
'SRE, ''DBA, ''Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (''Marostegui)'
|
2021-02-15 09:01:30
|
<wikibugs>
|
'SRE, ''DBA, ''Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (''Marostegui)'
|
2021-02-15 09:01:32
|
<wikibugs>
|
('CR) ''Elukey: [V: ''+1 C: ''+2] hadoop: enable HDFS service port for Analytics Hadoop [puppet] - ''https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629) (owner: ''Elukey)'
|
2021-02-15 09:04:52
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 20%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14348 and previous config saved to /var/cache/conftool/dbconfig/20210215-090451-root.json
|
2021-02-15 09:05:56
|
<wikibugs>
|
('PS1) ''Joal: Update oozie sharelib creation [puppet] - ''https://gerrit.wikimedia.org/r/664172 (https://phabricator.wikimedia.org/T274322)'
|
2021-02-15 09:06:00
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''-1] "You do mix list indention styles a bit, don't know if we should argue about it or just leave it be." (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/651757 (owner: ''Giuseppe Lavagetto)'
|
2021-02-15 09:06:03
|
<joal>
|
elukey: --^
|
2021-02-15 09:06:06
|
<joal>
|
for when you have time
|
2021-02-15 09:07:48
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+2] Update oozie sharelib creation [puppet] - ''https://gerrit.wikimedia.org/r/664172 (https://phabricator.wikimedia.org/T274322) (owner: ''Joal)'
|
2021-02-15 09:11:52
|
<logmsgbot>
|
!log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
|
2021-02-15 09:12:50
|
<wikibugs>
|
'SRE, ''Dumps-Generation, ''Platform Engineering, ''serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (''ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1006.eqiad.wmnet ` The log can be fo...'
|
2021-02-15 09:13:58
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: grafana: stop POST to /api/snapshots [puppet] - ''https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736)'
|
2021-02-15 09:15:13
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [V: ''+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28057/console"; [puppet] - ''https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: ''Filippo Giunchedi)'
|
2021-02-15 09:15:53
|
<wikibugs>
|
('CR) ''Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 09:17:11
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C: ''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: ''Filippo Giunchedi)'
|
2021-02-15 09:17:49
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [V: ''+1 C: ''+2] grafana: stop POST to /api/snapshots [puppet] - ''https://gerrit.wikimedia.org/r/664224 (https://phabricator.wikimedia.org/T274736) (owner: ''Filippo Giunchedi)'
|
2021-02-15 09:19:55
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14349 and previous config saved to /var/cache/conftool/dbconfig/20210215-091955-root.json
|
2021-02-15 09:24:00
|
<wikibugs>
|
('PS1) ''ArielGlenn: misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - ''https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377)'
|
2021-02-15 09:24:02
|
<wikibugs>
|
('CR) ''David Caro: "Got a couple questions, nits you can safely ignore :)" (''6 comments) [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 09:24:20
|
<wikibugs>
|
('PS2) ''JMeybohm: mathoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/663873 (https://phabricator.wikimedia.org/T274262) (owner: ''PipelineBot)'
|
2021-02-15 09:24:39
|
<wikibugs>
|
('CR) ''Ayounsi: [C: ''+2] Remove sampling feature flag [homer/public] - ''https://gerrit.wikimedia.org/r/663533 (owner: ''Ayounsi)'
|
2021-02-15 09:25:47
|
<logmsgbot>
|
!log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1006.eqiad.wmnet with reason: REIMAGE
|
2021-02-15 09:26:49
|
<wikibugs>
|
('CR) ''Ayounsi: "confirmed NOOP." [homer/public] - ''https://gerrit.wikimedia.org/r/663533 (owner: ''Ayounsi)'
|
2021-02-15 09:27:52
|
<logmsgbot>
|
!log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1006.eqiad.wmnet with reason: REIMAGE
|
2021-02-15 09:28:41
|
<wikibugs>
|
('PS1) ''Vgutierrez: admin: Add christinedk user [puppet] - ''https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304)'
|
2021-02-15 09:28:43
|
<wikibugs>
|
('PS1) ''Vgutierrez: admin: Add christinedk to analytics-privatedata-users [puppet] - ''https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304)'
|
2021-02-15 09:34:59
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 30%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14350 and previous config saved to /var/cache/conftool/dbconfig/20210215-093458-root.json
|
2021-02-15 09:35:26
|
<wikibugs>
|
('CR) ''Muehlenhoff: admin: Add christinedk user (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: ''Vgutierrez)'
|
2021-02-15 09:37:27
|
<wikibugs>
|
'SRE, ''observability: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662 (''Volans) If we allow for normal reboots going unnoticed, would we catch a scenario in which the icinga host reboots every 5 minutes due to a bug or DoS? P.S. Keyholder is not armed aft...'
|
2021-02-15 09:43:50
|
<elukey>
|
!log roll restart HDFS daemons in Analytics Hadoop to pick up new RPC queue changes - T273629
|
2021-02-15 09:47:55
|
<wikibugs>
|
('CR) ''Volans: "Optional nit inline" (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/663860 (owner: ''Hnowlan)'
|
2021-02-15 09:50:03
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 40%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14351 and previous config saved to /var/cache/conftool/dbconfig/20210215-095002-root.json
|
2021-02-15 09:50:41
|
<wikibugs>
|
'SRE, ''Dumps-Generation, ''Platform Engineering, ''serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (''ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1006.eqiad.wmnet'] ` and were **ALL** successful.'
|
2021-02-15 09:55:54
|
<wikibugs>
|
('PS1) ''Jcrespo: Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - ''https://gerrit.wikimedia.org/r/663961'
|
2021-02-15 09:56:15
|
<wikibugs>
|
('PS2) ''Jcrespo: Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - ''https://gerrit.wikimedia.org/r/663961'
|
2021-02-15 09:57:14
|
<wikibugs>
|
'SRE, ''Dumps-Generation, ''Platform Engineering, ''serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (''ArielGlenn) I was not going to re-image snapshot1005 and 6 because their replacements were due to have come in, but the boxes have not arrived yet a...'
|
2021-02-15 09:57:18
|
<wikibugs>
|
'SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (''MoritzMuehlenhoff) p:''Triage→''Medium'
|
2021-02-15 09:57:34
|
<wikibugs>
|
('PS2) ''ArielGlenn: misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - ''https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377)'
|
2021-02-15 09:57:51
|
<wikibugs>
|
'SRE, ''Packaging: Copy cassandra packages to buster-wikimedia - https://phabricator.wikimedia.org/T274119 (''MoritzMuehlenhoff) p:''Triage→''Medium'
|
2021-02-15 09:58:12
|
<wikibugs>
|
('CR) ''Jcrespo: [C: ''+2] Revert "dbbackups: disable all ES db bacula runs until next week" [puppet] - ''https://gerrit.wikimedia.org/r/663961 (owner: ''Jcrespo)'
|
2021-02-15 09:59:02
|
<wikibugs>
|
('CR) ''ArielGlenn: [C: ''+2] misc dumps: move commons rdf to later on Sunday and media info to earlier [puppet] - ''https://gerrit.wikimedia.org/r/664225 (https://phabricator.wikimedia.org/T269377) (owner: ''ArielGlenn)'
|
2021-02-15 10:00:12
|
<apergos>
|
jynus: may I merge your puppet patch "backup::set { 'mysql-srv-backups-dumps-latest':" etc?
|
2021-02-15 10:00:17
|
<jynus>
|
yes
|
2021-02-15 10:00:41
|
<apergos>
|
done!
|
2021-02-15 10:00:44
|
<jynus>
|
thanks
|
2021-02-15 10:02:02
|
<apergos>
|
thanks for the quick response!
|
2021-02-15 10:05:06
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14352 and previous config saved to /var/cache/conftool/dbconfig/20210215-100505-root.json
|
2021-02-15 10:09:14
|
<hashar>
|
!log Switching Jenkins jobs to Quibble 0.0.46
|
2021-02-15 10:15:52
|
<wikibugs>
|
'SRE, ''ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (''fgiunchedi) Thank you for all the work ! LMK how I can help e.g. if speeding up the decom of one host in T272836 would help (as opposed as decom'ing all hosts at the same time)'
|
2021-02-15 10:20:09
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 60%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14353 and previous config saved to /var/cache/conftool/dbconfig/20210215-102009-root.json
|
2021-02-15 10:23:30
|
<logmsgbot>
|
!log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host netmon1002.wikimedia.org
|
2021-02-15 10:27:29
|
<logmsgbot>
|
!log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1002.wikimedia.org
|
2021-02-15 10:30:08
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] db1134: Do not be tag as candidate master [puppet] - ''https://gerrit.wikimedia.org/r/664230 (https://phabricator.wikimedia.org/T274472) (owner: ''Marostegui)'
|
2021-02-15 10:31:09
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: dumps: distribution: nfs: allow establishing connections with TCP ports > 1024 [puppet] - ''https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397)'
|
2021-02-15 10:35:13
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 70%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14355 and previous config saved to /var/cache/conftool/dbconfig/20210215-103512-root.json
|
2021-02-15 10:41:25
|
<wikibugs>
|
('PS2) ''Arturo Borrero Gonzalez: dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024 [puppet] - ''https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397)'
|
2021-02-15 10:44:09
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024 [puppet] - ''https://gerrit.wikimedia.org/r/664231 (https://phabricator.wikimedia.org/T272397) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 10:47:08
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: labstore: allow NFS connections from public cloud networks [puppet] - ''https://gerrit.wikimedia.org/r/664233 (https://phabricator.wikimedia.org/T272397)'
|
2021-02-15 10:48:49
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] labstore: allow NFS connections from public cloud networks [puppet] - ''https://gerrit.wikimedia.org/r/664233 (https://phabricator.wikimedia.org/T272397) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 10:49:05
|
<wikibugs>
|
('PS1) ''ArielGlenn: swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - ''https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713)'
|
2021-02-15 10:50:16
|
<godog>
|
jouncebot: next
|
2021-02-15 10:50:16
|
<jouncebot>
|
In 0 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1130)
|
2021-02-15 10:50:16
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 80%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14356 and previous config saved to /var/cache/conftool/dbconfig/20210215-105016-root.json
|
2021-02-15 10:57:30
|
<wikibugs>
|
('PS2) ''ArielGlenn: swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - ''https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713)'
|
2021-02-15 10:57:59
|
<logmsgbot>
|
!log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet
|
2021-02-15 10:58:44
|
<wikibugs>
|
('PS1) ''Jcrespo: Preventive commit for jynus to misspell "bullseye", next Debian version [puppet] - ''https://gerrit.wikimedia.org/r/664237'
|
2021-02-15 10:58:59
|
<wikibugs>
|
('CR) ''ArielGlenn: [C: ''+2] swap roles of dumpsdata1001 and 1003 so 1003 is primary for xml/sql dumps [puppet] - ''https://gerrit.wikimedia.org/r/664234 (https://phabricator.wikimedia.org/T273713) (owner: ''ArielGlenn)'
|
2021-02-15 11:00:25
|
<logmsgbot>
|
!log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet
|
2021-02-15 11:02:02
|
<wikibugs>
|
('CR) ''Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/28058/"; [puppet] - ''https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: ''Effie Mouzeli)'
|
2021-02-15 11:03:17
|
<wikibugs>
|
('PS2) ''Hnowlan: mtail: add exception handling in tests for non-Debian OSes [puppet] - ''https://gerrit.wikimedia.org/r/663860'
|
2021-02-15 11:05:20
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 90%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14357 and previous config saved to /var/cache/conftool/dbconfig/20210215-110519-root.json
|
2021-02-15 11:06:51
|
<icinga-wm>
|
RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
|
2021-02-15 11:07:27
|
<icinga-wm>
|
RECOVERY - tilerator on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
|
2021-02-15 11:08:21
|
<icinga-wm>
|
RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 11:10:25
|
<logmsgbot>
|
!log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps2007.codfw.wmnet
|
2021-02-15 11:11:57
|
<wikibugs>
|
('CR) ''Hnowlan: mtail: add exception handling in tests for non-Debian OSes (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/663860 (owner: ''Hnowlan)'
|
2021-02-15 11:14:57
|
<wikibugs>
|
('PS1) ''Elukey: profile::hadoop::master: raise threshold for corrupt blocks [puppet] - ''https://gerrit.wikimedia.org/r/664238'
|
2021-02-15 11:16:50
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+2] profile::hadoop::master: raise threshold for corrupt blocks [puppet] - ''https://gerrit.wikimedia.org/r/664238 (owner: ''Elukey)'
|
2021-02-15 11:20:24
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14358 and previous config saved to /var/cache/conftool/dbconfig/20210215-112023-root.json
|
2021-02-15 11:27:16
|
<wikibugs>
|
('PS4) ''Arturo Borrero Gonzalez: cloud: drop NAT exceptions for dumps NFS [puppet] - ''https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397)'
|
2021-02-15 11:28:11
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes - elukey@cumin1001
|
2021-02-15 11:28:44
|
<elukey>
|
this may trigger (I hope not) AQS alerts --^
|
2021-02-15 11:28:52
|
<elukey>
|
in case it is my fault and you can blame me
|
2021-02-15 11:29:05
|
<elukey>
|
sees kormat ready for it
|
2021-02-15 11:29:31
|
<kormat>
|
nods solemnly
|
2021-02-15 11:29:57
|
<wikibugs>
|
('CR) ''Hnowlan: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 11:30:04
|
<jouncebot>
|
jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1130).
|
2021-02-15 11:32:31
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: cloudgw: move common hiera into proper file [puppet] - ''https://gerrit.wikimedia.org/r/664241 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 11:33:13
|
<wikibugs>
|
('CR) ''Jbond: "See comments inline, also wonder if you considered using pathlib for the file operations." (''5 comments) [puppet] - ''https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: ''CRusnov)'
|
2021-02-15 11:33:17
|
<wikibugs>
|
('PS4) ''Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - ''https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115)'
|
2021-02-15 11:33:19
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] cloudgw: move common hiera into proper file [puppet] - ''https://gerrit.wikimedia.org/r/664241 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 11:34:50
|
<wikibugs>
|
('PS5) ''Arturo Borrero Gonzalez: cloud: drop NAT exceptions for dumps NFS [puppet] - ''https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397)'
|
2021-02-15 11:37:34
|
<moritzm>
|
!log reimaging bast5001 to buster
|
2021-02-15 11:45:23
|
<wikibugs>
|
('CR) ''Jbond: "Adding Andrew to approve privatedata-users access" [puppet] - ''https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304) (owner: ''Vgutierrez)'
|
2021-02-15 11:52:45
|
<logmsgbot>
|
!log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1007.eqiad.wmnet
|
2021-02-15 11:54:09
|
<wikibugs>
|
('CR) ''Jbond: "see comments" (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/663993 (owner: ''Urbanecm)'
|
2021-02-15 11:55:13
|
<wikibugs>
|
('CR) ''Urbanecm: Update urbanecm's dotfiles (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/663993 (owner: ''Urbanecm)'
|
2021-02-15 11:55:23
|
<wikibugs>
|
('PS2) ''Urbanecm: Update urbanecm's dotfiles [puppet] - ''https://gerrit.wikimedia.org/r/663993'
|
2021-02-15 11:56:00
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] Update urbanecm's dotfiles [puppet] - ''https://gerrit.wikimedia.org/r/663993 (owner: ''Urbanecm)'
|
2021-02-15 11:56:21
|
<jbond42>
|
Urbanecm: ^^ merged
|
2021-02-15 11:56:24
|
<Urbanecm>
|
thanks jbond42 !
|
2021-02-15 11:56:28
|
<jbond42>
|
:) np
|
2021-02-15 11:58:52
|
<logmsgbot>
|
!log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1007.eqiad.wmnet
|
2021-02-15 12:00:05
|
<jouncebot>
|
Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1200).
|
2021-02-15 12:00:05
|
<jouncebot>
|
No GERRIT patches in the queue for this window AFAICS.
|
2021-02-15 12:00:14
|
<Urbanecm>
|
I'll deploy regardless
|
2021-02-15 12:01:12
|
<wikibugs>
|
('CR) ''Urbanecm: [C: ''+2] Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) (owner: ''Urbanecm)'
|
2021-02-15 12:02:46
|
<wikibugs>
|
('Merged) ''jenkins-bot: Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) (owner: ''Urbanecm)'
|
2021-02-15 12:04:02
|
<wikibugs>
|
('PS1) ''Effie Mouzeli: hiera: install memcached 1.6 on mc1037 [puppet] - ''https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315)'
|
2021-02-15 12:06:36
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+1] "thanks this will also be a big help to me 😊" [puppet] - ''https://gerrit.wikimedia.org/r/664237 (owner: ''Jcrespo)'
|
2021-02-15 12:07:47
|
<logmsgbot>
|
!log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast5001.wikimedia.org with reason: REIMAGE
|
2021-02-15 12:07:55
|
<wikibugs>
|
('PS22) ''Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - ''https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893)'
|
2021-02-15 12:08:54
|
<Urbanecm>
|
can someone check mwdebug1002.eqiad.wmnet status, and remove it from scap if it is still broken (as mutante said in ops list)?
|
2021-02-15 12:09:16
|
<wikibugs>
|
('CR) ''Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/28065/mc2037.codfw.wmnet/index.html"; [puppet] - ''https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) (owner: ''Effie Mouzeli)'
|
2021-02-15 12:09:47
|
<logmsgbot>
|
!log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast5001.wikimedia.org with reason: REIMAGE
|
2021-02-15 12:09:59
|
<wikibugs>
|
('PS2) ''Muehlenhoff: Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - ''https://gerrit.wikimedia.org/r/662918'
|
2021-02-15 12:10:35
|
<logmsgbot>
|
!log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 662d5f6af01f6cf6ce7e9d56cf1bc3ba282afee1: Revert "Revert "Enable SandboxLink at viwiki"" (T272796) (duration: 05m 26s)
|
2021-02-15 12:10:41
|
<Urbanecm>
|
finally
|
2021-02-15 12:11:36
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (''MoritzMuehlenhoff) Also needs approval by @Ottomata for Hadoop access.'
|
2021-02-15 12:13:39
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+1] linkrecommendation: Cron job to load datasets [deployment-charts] - ''https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: ''Kosta Harlan)'
|
2021-02-15 12:14:25
|
<wikibugs>
|
('CR) ''Kosta Harlan: [C: ''+2] linkrecommendation: Cron job to load datasets [deployment-charts] - ''https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: ''Kosta Harlan)'
|
2021-02-15 12:15:59
|
<wikibugs>
|
('Merged) ''jenkins-bot: linkrecommendation: Cron job to load datasets [deployment-charts] - ''https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: ''Kosta Harlan)'
|
2021-02-15 12:16:19
|
<wikibugs>
|
('CR) ''Vgutierrez: [C: ''+1] delete class tlsproxy::prometheus and nginx template [puppet] - ''https://gerrit.wikimedia.org/r/659377 (https://phabricator.wikimedia.org/T272559) (owner: ''Dzahn)'
|
2021-02-15 12:16:21
|
<wikibugs>
|
('PS2) ''Urbanecm: ukwikisource: Finish removal of NS Translations [mediawiki-config] - ''https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628)'
|
2021-02-15 12:16:24
|
<wikibugs>
|
('CR) ''Urbanecm: [C: ''+2] ukwikisource: Finish removal of NS Translations [mediawiki-config] - ''https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) (owner: ''Urbanecm)'
|
2021-02-15 12:17:21
|
<wikibugs>
|
('Merged) ''jenkins-bot: ukwikisource: Finish removal of NS Translations [mediawiki-config] - ''https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) (owner: ''Urbanecm)'
|
2021-02-15 12:17:27
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+1] "left a nit for the commit msg, LGTM otherwise!" (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) (owner: ''Effie Mouzeli)'
|
2021-02-15 12:18:18
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+1] "Effie can you run a pcc to see if everything looks good?" [puppet] - ''https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) (owner: ''Effie Mouzeli)'
|
2021-02-15 12:18:47
|
<logmsgbot>
|
!log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
|
2021-02-15 12:21:30
|
<Urbanecm>
|
repeating myself: can someone depool mwdebug1002? it's currently down (see mail from dzahn in ops list), but still pooled and thus in scap dsh group :/
|
2021-02-15 12:22:25
|
<wikibugs>
|
'SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (''MoritzMuehlenhoff) p:''Triage→''Medium'
|
2021-02-15 12:23:45
|
<wikibugs>
|
'SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (''MoritzMuehlenhoff) Adding a few tags for affected sub teams, simply untag when completed'
|
2021-02-15 12:24:38
|
<wikibugs>
|
'SRE, ''Analytics, ''observability, ''serviceops, ''cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (''MoritzMuehlenhoff)'
|
2021-02-15 12:25:33
|
<wikibugs>
|
('CR) ''Volans: "quick direct reply, will have a pass later" (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: ''CRusnov)'
|
2021-02-15 12:25:55
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: "Thanks for the review!" (''6 comments) [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 12:30:51
|
<wikibugs>
|
('PS1) ''JMeybohm: admin: Allow tiller to create batch ressources [deployment-charts] - ''https://gerrit.wikimedia.org/r/664273'
|
2021-02-15 12:32:02
|
<wikibugs>
|
('CR) ''JMeybohm: [V: ''+2 C: ''+2] admin: Allow tiller to create batch ressources [deployment-charts] - ''https://gerrit.wikimedia.org/r/664273 (owner: ''JMeybohm)'
|
2021-02-15 12:32:29
|
<logmsgbot>
|
!log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet
|
2021-02-15 12:33:32
|
<wikibugs>
|
('Merged) ''jenkins-bot: admin: Allow tiller to create batch ressources [deployment-charts] - ''https://gerrit.wikimedia.org/r/664273 (owner: ''JMeybohm)'
|
2021-02-15 12:35:00
|
<logmsgbot>
|
!log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
|
2021-02-15 12:35:39
|
<logmsgbot>
|
!log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cdf15981f7c6f7e02a3fb1c1ce61dc14815f216d: ukwikisource: Finish removal of NS Translations (T270628) (duration: 01m 07s)
|
2021-02-15 12:36:24
|
<wikibugs>
|
('PS1) ''Elukey: Add/Fix kerberos fake keytabs [labs/private] - ''https://gerrit.wikimedia.org/r/664274 (https://phabricator.wikimedia.org/T274392)'
|
2021-02-15 12:36:46
|
<wikibugs>
|
('CR) ''Elukey: [V: ''+2 C: ''+2] Add/Fix kerberos fake keytabs [labs/private] - ''https://gerrit.wikimedia.org/r/664274 (https://phabricator.wikimedia.org/T274392) (owner: ''Elukey)'
|
2021-02-15 12:37:06
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes - elukey@cumin1001
|
2021-02-15 12:37:32
|
<wikibugs>
|
'SRE, ''ops-eqiad, ''DC-Ops, ''Wikidata, and 2 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (''MoritzMuehlenhoff) p:''Triage→''Medium'
|
2021-02-15 12:38:28
|
<wikibugs>
|
('CR) ''David Caro: [C: ''+1] cloudgw: introduce HA by using keepalived/VRRP (''6 comments) [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 12:38:36
|
<wikibugs>
|
('PS9) ''Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 12:38:38
|
<wikibugs>
|
'SRE, ''observability, ''serviceops, ''Patch-For-Review, ''cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (''elukey)'
|
2021-02-15 12:39:18
|
<logmsgbot>
|
!log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
|
2021-02-15 12:40:12
|
<wikibugs>
|
('PS10) ''Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 12:43:59
|
<moritzm>
|
!log reimaging bast4002 to buster
|
2021-02-15 12:44:04
|
<wikibugs>
|
('PS11) ''Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 12:44:09
|
<logmsgbot>
|
!log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
|
2021-02-15 12:44:39
|
<icinga-wm>
|
PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
|
2021-02-15 12:44:59
|
<icinga-wm>
|
PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
|
2021-02-15 12:45:53
|
<icinga-wm>
|
PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
|
2021-02-15 12:46:25
|
<icinga-wm>
|
RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
|
2021-02-15 12:47:24
|
<wikibugs>
|
('PS12) ''Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 12:47:35
|
<icinga-wm>
|
RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9184 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
|
2021-02-15 12:47:58
|
<wikibugs>
|
('CR) ''Effie Mouzeli: "> Patch Set 1: Code-Review+1" [puppet] - ''https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) (owner: ''Effie Mouzeli)'
|
2021-02-15 12:48:27
|
<icinga-wm>
|
RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
|
2021-02-15 12:49:10
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/28075/"; [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 12:49:13
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [V: ''+2 C: ''+2] cloudgw: introduce HA by using keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 12:49:45
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 T273955', diff saved to https://phabricator.wikimedia.org/P14359 and previous config saved to /var/cache/conftool/dbconfig/20210215-124944-marostegui.json
|
2021-02-15 12:50:24
|
<wikibugs>
|
('PS2) ''David Caro: utils: add script to run docker ci tests locally [software/spicerack] - ''https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338)'
|
2021-02-15 12:50:27
|
<wikibugs>
|
('PS1) ''Marostegui: db1093: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/664276 (https://phabricator.wikimedia.org/T273955)'
|
2021-02-15 12:50:50
|
<logmsgbot>
|
!log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
|
2021-02-15 12:51:16
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] db1093: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/664276 (https://phabricator.wikimedia.org/T273955) (owner: ''Marostegui)'
|
2021-02-15 12:58:16
|
<logmsgbot>
|
!log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
|
2021-02-15 12:58:16
|
<logmsgbot>
|
!log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
|
2021-02-15 12:58:19
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 12:58:22
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:01:10
|
<Lucas_WMDE>
|
we lost a whole bunch of SAL messages because stashbot was out
|
2021-02-15 13:01:12
|
<logmsgbot>
|
!log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE
|
2021-02-15 13:01:15
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:01:21
|
<Lucas_WMDE>
|
is it worth repeating them all?
|
2021-02-15 13:01:49
|
<Lucas_WMDE>
|
cc marostegui, ryankemper, ariel, elukey…
|
2021-02-15 13:02:04
|
<marostegui>
|
Lucas_WMDE: not from my side, thanks though! :)
|
2021-02-15 13:02:10
|
<Lucas_WMDE>
|
ok
|
2021-02-15 13:02:26
|
<Lucas_WMDE>
|
sometimes I do it but this seems to be almost 50 missed messages and I’m lazy :D
|
2021-02-15 13:02:41
|
<logmsgbot>
|
!log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
|
2021-02-15 13:02:41
|
<logmsgbot>
|
!log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
|
2021-02-15 13:02:44
|
<Lucas_WMDE>
|
(they’re all in the IRC log)
|
2021-02-15 13:02:44
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:02:48
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:02:50
|
<wikibugs>
|
'SRE, ''DBA, ''Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (''Marostegui) db1162 is fully pooled'
|
2021-02-15 13:03:18
|
<logmsgbot>
|
!log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE
|
2021-02-15 13:03:21
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:05:58
|
<Lucas_WMDE>
|
!log notice: stashbot had issues between 8:19 and 12:50, see for https://wm-bot.wmflabs.org/browser/index.php?start=02%2F15%2F2021&end=02%2F15%2F2021&display=%23wikimedia-operations for missed !log messages
|
2021-02-15 13:06:01
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:06:54
|
<godog>
|
!log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
|
2021-02-15 13:06:57
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:06:58
|
<stashbot>
|
T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
|
2021-02-15 13:14:05
|
<wikibugs>
|
('PS1) ''JMeybohm: linkrecommendation: Read DB_USER from public config [deployment-charts] - ''https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893)'
|
2021-02-15 13:14:16
|
<jayme>
|
^ kostajh
|
2021-02-15 13:14:58
|
<wikibugs>
|
('CR) ''Kosta Harlan: [C: ''+2] linkrecommendation: Read DB_USER from public config [deployment-charts] - ''https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893) (owner: ''JMeybohm)'
|
2021-02-15 13:15:30
|
<kostajh>
|
jayme: cheers
|
2021-02-15 13:17:35
|
<wikibugs>
|
('Merged) ''jenkins-bot: linkrecommendation: Read DB_USER from public config [deployment-charts] - ''https://gerrit.wikimedia.org/r/664277 (https://phabricator.wikimedia.org/T265893) (owner: ''JMeybohm)'
|
2021-02-15 13:19:28
|
<logmsgbot>
|
!log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
|
2021-02-15 13:19:36
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:21:47
|
<wikibugs>
|
('PS4) ''Hnowlan: mtail: create separate metrics histogram based on endpoint [puppet] - ''https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727)'
|
2021-02-15 13:22:04
|
<wikibugs>
|
('CR) ''Hnowlan: [V: ''+2 C: ''+2] tegola: Add docker image. [docker-images/production-images] - ''https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: ''Hnowlan)'
|
2021-02-15 13:28:57
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: "Shouldn't this instead be done via the pipeline? It would greatly decouple upgrading tegola from requiring an SRE to build newer versions " [docker-images/production-images] - ''https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: ''Hnowlan)'
|
2021-02-15 13:33:36
|
<marostegui>
|
!log Stop MySQL on db1093 - T273955
|
2021-02-15 13:33:39
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:33:41
|
<stashbot>
|
T273955: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955
|
2021-02-15 13:34:02
|
<wikibugs>
|
('PS5) ''Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - ''https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: ''Ladsgroup)'
|
2021-02-15 13:34:39
|
<wikibugs>
|
('CR) ''jerkins-bot: [V: ''-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - ''https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: ''Ladsgroup)'
|
2021-02-15 13:38:10
|
<moritzm>
|
!log installing subversion security updates
|
2021-02-15 13:38:14
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:41:38
|
<wikibugs>
|
('PS6) ''Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - ''https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: ''Ladsgroup)'
|
2021-02-15 13:43:11
|
<icinga-wm>
|
RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 13:47:55
|
<wikibugs>
|
('PS2) ''Muehlenhoff: admin: Add christinedk user [puppet] - ''https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: ''Vgutierrez)'
|
2021-02-15 13:48:03
|
<wikibugs>
|
('PS7) ''Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - ''https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: ''Ladsgroup)'
|
2021-02-15 13:48:13
|
<wikibugs>
|
('CR) ''jerkins-bot: [V: ''-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - ''https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: ''Ladsgroup)'
|
2021-02-15 13:53:00
|
<logmsgbot>
|
!log gehel@cumin2001 START - Cookbook sre.wdqs.data-reload
|
2021-02-15 13:53:03
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 13:57:13
|
<moritzm>
|
!log installing libonig security update for stretch
|
2021-02-15 13:57:16
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 14:08:09
|
<godog>
|
!log swift eqiad-prod: add weight back to sdg on ms-be1054 - T273582
|
2021-02-15 14:08:14
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 14:08:15
|
<stashbot>
|
T273582: Put sdg1 on ms-be1054 back in service - https://phabricator.wikimedia.org/T273582
|
2021-02-15 14:10:43
|
<wikibugs>
|
'SRE, ''SRE-swift-storage, ''Patch-For-Review, ''User-fgiunchedi: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (''fgiunchedi) ''Open→''Resolved I'm boldly resolving this again since
limiting memory usage for object replication processes helped a whole lot to...'
|
2021-02-15 14:12:42
|
<wikibugs>
|
('PS1) ''Urbanecm: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - ''https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789)'
|
2021-02-15 14:13:04
|
<Urbanecm>
|
jouncebot: now
|
2021-02-15 14:13:05
|
<jouncebot>
|
No deployments scheduled for the next 3 hour(s) and 46 minute(s)
|
2021-02-15 14:13:15
|
<wikibugs>
|
('PS2) ''Urbanecm: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - ''https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789)'
|
2021-02-15 14:13:18
|
<wikibugs>
|
('CR) ''Urbanecm: [C: ''+2] Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - ''https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) (owner: ''Urbanecm)'
|
2021-02-15 14:14:07
|
<wikibugs>
|
('Merged) ''jenkins-bot: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - ''https://gerrit.wikimedia.org/r/664294 (https://phabricator.wikimedia.org/T274789) (owner: ''Urbanecm)'
|
2021-02-15 14:17:02
|
<logmsgbot>
|
!log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 00905c4a7e4bb69f39e52e1c4d4d6168006b0e7b: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T274789) (duration: 01m 09s)
|
2021-02-15 14:17:06
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 14:17:07
|
<stashbot>
|
T274789: Add <https://static.president.az/>; to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T274789
|
2021-02-15 14:19:43
|
<icinga-wm>
|
PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 14:23:44
|
<wikibugs>
|
('PS8) ''Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - ''https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: ''Ladsgroup)'
|
2021-02-15 14:25:33
|
<icinga-wm>
|
RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 14:28:37
|
<wikibugs>
|
('CR) ''David Caro: utils: add script to run docker ci tests locally (''3 comments) [software/spicerack] - ''https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: ''David Caro)'
|
2021-02-15 14:31:40
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - ''https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: ''Ladsgroup)'
|
2021-02-15 14:34:23
|
<wikibugs>
|
'SRE, ''Maps, ''Product-Infrastructure-Team-Backlog, ''Services, ''Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (''MoritzMuehlenhoff) p:''Triage→''Medium'
|
2021-02-15 14:34:33
|
<wikibugs>
|
'SRE, ''Maps, ''Product-Infrastructure-Team-Backlog, ''Services, ''Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (''MoritzMuehlenhoff) p:''Triage→''Medium'
|
2021-02-15 14:45:09
|
<icinga-wm>
|
RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 14:48:25
|
<wikibugs>
|
('PS1) ''Jbond: Gemfile: increase dependency for wmf_style-stylegude-check [puppet] - ''https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953)'
|
2021-02-15 15:04:50
|
<godog>
|
!log upgrade grafana to 7.4.1 on grafana1002 - T263747
|
2021-02-15 15:04:54
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:04:55
|
<stashbot>
|
T263747: Upgrade Grafana to 7.4 - https://phabricator.wikimedia.org/T263747
|
2021-02-15 15:06:15
|
<wikibugs>
|
('CR) ''Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 15:06:27
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (''MoritzMuehlenhoff) Also adding @Ottomata for approval for analytics-privatedata-users.'
|
2021-02-15 15:09:46
|
<moritzm>
|
!log reimaging bast3004 to buster
|
2021-02-15 15:09:49
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:15:06
|
<wikibugs>
|
('PS1) ''Bartosz Dziewoński: CommentFormatter: Fix problems with editsection and quotes [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - ''https://gerrit.wikimedia.org/r/664254 (https://phabricator.wikimedia.org/T274709)'
|
2021-02-15 15:17:18
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet
|
2021-02-15 15:17:21
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:17:21
|
<wikibugs>
|
('CR) ''Jbond: "did a quick pass however im not that familiar with the current decom cook book" (''7 comments) [cookbooks] - ''https://gerrit.wikimedia.org/r/663878 (owner: ''Elukey)'
|
2021-02-15 15:20:05
|
<wikibugs>
|
('PS1) ''Kormat: integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - ''https://gerrit.wikimedia.org/r/664300'
|
2021-02-15 15:20:10
|
<icinga-wm>
|
PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
|
2021-02-15 15:27:56
|
<wikibugs>
|
('CR) ''Hashar: [C: ''+1] "Can be merged anytime, the CI job always does a gem update :]" [puppet] - ''https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953) (owner: ''Jbond)'
|
2021-02-15 15:28:49
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] Gemfile: increase dependency for wmf_style-stylegude-check [puppet] - ''https://gerrit.wikimedia.org/r/664297 (https://phabricator.wikimedia.org/T209953) (owner: ''Jbond)'
|
2021-02-15 15:30:19
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet
|
2021-02-15 15:30:23
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:31:21
|
<icinga-wm>
|
RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is OK: HTTP OK: HTTP/1.0 200 OK - 23547 bytes in 0.829 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
|
2021-02-15 15:33:08
|
<moritzm>
|
!log installing linux-4.19 update for Stretch on servers which have it installed (no reboots, just updating the kernels)
|
2021-02-15 15:33:12
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:33:35
|
<wikibugs>
|
('CR) ''Kormat: [C: ''+2] integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - ''https://gerrit.wikimedia.org/r/664300 (owner: ''Kormat)'
|
2021-02-15 15:34:16
|
<logmsgbot>
|
!log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE
|
2021-02-15 15:34:20
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:34:30
|
<wikibugs>
|
('CR) ''Jcrespo: [C: ''+2] Preventive commit for jynus to misspell "bullseye", next Debian version [puppet] - ''https://gerrit.wikimedia.org/r/664237 (owner: ''Jcrespo)'
|
2021-02-15 15:36:11
|
<logmsgbot>
|
!log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
|
2021-02-15 15:36:12
|
<logmsgbot>
|
!log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
|
2021-02-15 15:36:12
|
<logmsgbot>
|
!log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
|
2021-02-15 15:36:15
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:36:18
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:36:20
|
<logmsgbot>
|
!log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE
|
2021-02-15 15:36:22
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:36:25
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:36:46
|
<wikibugs>
|
('PS1) ''Jcrespo: testing test at test at testing [puppet] - ''https://gerrit.wikimedia.org/r/664301'
|
2021-02-15 15:36:54
|
<wikibugs>
|
('Merged) ''jenkins-bot: integration_env: Rework cli to simplify operations [software/wmfmariadbpy] - ''https://gerrit.wikimedia.org/r/664300 (owner: ''Kormat)'
|
2021-02-15 15:38:02
|
<wikibugs>
|
('CR) ''jerkins-bot: [V: ''-1] testing test at test at testing [puppet] - ''https://gerrit.wikimedia.org/r/664301 (owner: ''Jcrespo)'
|
2021-02-15 15:38:36
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet
|
2021-02-15 15:38:39
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:38:49
|
<wikibugs>
|
('CR) ''Jcrespo: "16:37:55 Typo found!" [puppet] - ''https://gerrit.wikimedia.org/r/664301 (owner: ''Jcrespo)'
|
2021-02-15 15:39:13
|
<wikibugs>
|
('Abandoned) ''Jcrespo: testing test at test at testing [puppet] - ''https://gerrit.wikimedia.org/r/664301 (owner: ''Jcrespo)'
|
2021-02-15 15:39:46
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: [C: ''-1] "1 pedantic comment but perhaps we can solve this more easily, see inline." (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/659863 (owner: ''JMeybohm)'
|
2021-02-15 15:39:52
|
<wikibugs>
|
'SRE: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (''fgiunchedi)'
|
2021-02-15 15:40:39
|
<wikibugs>
|
('PS1) ''Elukey: hadoop: update the HDFS Namenode rack configuration [puppet] - ''https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795)'
|
2021-02-15 15:41:13
|
<wikibugs>
|
'SRE: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (''fgiunchedi)'
|
2021-02-15 15:44:52
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: "+1, but perhaps we don't even need it? See dependent commit" [deployment-charts] - ''https://gerrit.wikimedia.org/r/659864 (owner: ''JMeybohm)'
|
2021-02-15 15:45:07
|
<wikibugs>
|
('PS1) ''Muehlenhoff: Add a comment to the snapshot block [puppet] - ''https://gerrit.wikimedia.org/r/664303'
|
2021-02-15 15:46:19
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet
|
2021-02-15 15:46:21
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:46:44
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - ''https://gerrit.wikimedia.org/r/664255'
|
2021-02-15 15:46:53
|
<wikibugs>
|
('PS2) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - ''https://gerrit.wikimedia.org/r/664255'
|
2021-02-15 15:47:25
|
<wikibugs>
|
('PS3) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - ''https://gerrit.wikimedia.org/r/664255 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 15:47:45
|
<wikibugs>
|
('PS2) ''Elukey: hadoop: update the HDFS Namenode rack configuration [puppet] - ''https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795)'
|
2021-02-15 15:48:09
|
<logmsgbot>
|
!log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
|
2021-02-15 15:48:12
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:48:54
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet
|
2021-02-15 15:48:56
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:49:03
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] Revert "cloud: hiera: add vlan 2120 back into the neutron bridge" [puppet] - ''https://gerrit.wikimedia.org/r/664255 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 15:50:13
|
<wikibugs>
|
('PS1) ''Muehlenhoff: Remove obsolete cloudera config from reprepro [puppet] - ''https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797)'
|
2021-02-15 15:50:56
|
<wikibugs>
|
('CR) ''Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 15:51:26
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet
|
2021-02-15 15:51:26
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - ''https://gerrit.wikimedia.org/r/664256'
|
2021-02-15 15:51:29
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:51:39
|
<wikibugs>
|
('PS2) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - ''https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 15:51:47
|
<wikibugs>
|
('PS3) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - ''https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 15:52:15
|
<wikibugs>
|
'SRE, ''Patch-For-Review: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (''fgiunchedi) Note that the elastic 5 "not found" errors seem flappy, I just got a `checkupdate` run without those errors'
|
2021-02-15 15:53:19
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - ''https://gerrit.wikimedia.org/r/664257'
|
2021-02-15 15:53:26
|
<wikibugs>
|
('PS2) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - ''https://gerrit.wikimedia.org/r/664257'
|
2021-02-15 15:53:34
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet
|
2021-02-15 15:53:36
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:53:37
|
<wikibugs>
|
('PS3) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - ''https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 15:53:49
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] Revert "cloud: hiera: connect cloudnet servers back to vlan 2120" [puppet] - ''https://gerrit.wikimedia.org/r/664256 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 15:53:57
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] Add a comment to the snapshot block [puppet] - ''https://gerrit.wikimedia.org/r/664303 (owner: ''Muehlenhoff)'
|
2021-02-15 15:57:26
|
<wikibugs>
|
('PS4) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" This reverts commit 5ca98c9df08f6c6e2d97bc7b6279cdaf573eddce. Reason for revert: rebuilding the cloudgw setup Bug: T272963 Change-Id: I8185f4fa36a70255940d78db45b0f50cfc6abb98 Signed-off-by: Arturo Borrero Gonzalez <aborrero@wikimedia.org> [puppet] - ''https://gerrit.wikimedia.org/r/664257 (https://phabricator.wi'
|
2021-02-15 15:58:00
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet
|
2021-02-15 15:58:03
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 15:58:12
|
<wikibugs>
|
('PS5) ''Arturo Borrero Gonzalez: Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - ''https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 15:58:20
|
<wikibugs>
|
'SRE, ''SRE-tools, ''User-Joe: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (''jijiki)'
|
2021-02-15 16:02:38
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] Revert "cloud: hiera: enable back neutron hacks in codfw1dev" [puppet] - ''https://gerrit.wikimedia.org/r/664257 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 16:04:06
|
<wikibugs>
|
('CR) ''Volans: "Thanks for the refactor, some comments inline, some already discussed over IRC." (''14 comments) [software/spicerack] - ''https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: ''David Caro)'
|
2021-02-15 16:04:51
|
<wikibugs>
|
'SRE, ''ops-eqiad, ''DC-Ops, ''Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (''Gehel)'
|
2021-02-15 16:05:18
|
<logmsgbot>
|
!log aborrero@cumin2001 START - Cookbook sre.hosts.reboot-single for host cloudnet2003-dev.codfw.wmnet
|
2021-02-15 16:05:21
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:05:56
|
<wikibugs>
|
'SRE: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (''MoritzMuehlenhoff)'
|
2021-02-15 16:07:37
|
<logmsgbot>
|
!log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2001.codfw.wmnet
|
2021-02-15 16:07:40
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:09:49
|
<logmsgbot>
|
!log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2003-dev.codfw.wmnet
|
2021-02-15 16:09:52
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:10:23
|
<icinga-wm>
|
PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
|
2021-02-15 16:11:29
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 16:11:55
|
<icinga-wm>
|
RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
|
2021-02-15 16:12:12
|
<logmsgbot>
|
!log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2001.codfw.wmnet
|
2021-02-15 16:12:16
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:12:57
|
<logmsgbot>
|
!log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2002.codfw.wmnet
|
2021-02-15 16:13:01
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:14:35
|
<hoo>
|
!log Updated the Wikidata property suggester with data from the 2021-02-01 JSON dump (with pre-applied T132839 workarounds)
|
2021-02-15 16:14:38
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:14:40
|
<stashbot>
|
T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839
|
2021-02-15 16:16:34
|
<wikibugs>
|
('PS2) ''Arturo Borrero Gonzalez: cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 16:18:20
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP [puppet] - ''https://gerrit.wikimedia.org/r/664307 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 16:18:35
|
<logmsgbot>
|
!log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2002.codfw.wmnet
|
2021-02-15 16:18:39
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:22:08
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C: ''+2] Add a comment to the snapshot block [puppet] - ''https://gerrit.wikimedia.org/r/664303 (owner: ''Muehlenhoff)'
|
2021-02-15 16:22:14
|
<logmsgbot>
|
!log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet
|
2021-02-15 16:22:17
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:24:53
|
<wikibugs>
|
'SRE: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (''Volans) p:''Triage→''High a:''Volans'
|
2021-02-15 16:25:11
|
<wikibugs>
|
('PS1) ''Volans: interface automation: fix typo in method name [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664308 (https://phabricator.wikimedia.org/T274802)'
|
2021-02-15 16:26:03
|
<jayme>
|
!log rolled back linkrecommendation helm releases to the most recent revision running chart verion linkrecommendation-0.0.4 on clusters codfw and eqiad (cc: kostajh)
|
2021-02-15 16:26:05
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:27:09
|
<logmsgbot>
|
!log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1001.eqiad.wmnet
|
2021-02-15 16:27:13
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:28:09
|
<wikibugs>
|
('CR) ''Volans: [C: ''+2] "self merging as it's just a typo, will run the script against bast3004 manually to verify it" [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664308 (https://phabricator.wikimedia.org/T274802) (owner: ''Volans)'
|
2021-02-15 16:32:38
|
<logmsgbot>
|
!log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1001.eqiad.wmnet
|
2021-02-15 16:32:43
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:33:48
|
<volans>
|
!log restarted netbox on netbox1001
|
2021-02-15 16:33:51
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:36:18
|
<icinga-wm>
|
PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 16:36:42
|
<wikibugs>
|
('PS1) ''Volans: interface automation: fix typo in method name (2) [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802)'
|
2021-02-15 16:37:12
|
<volans>
|
mmmh icinga, are you sure? it's all good there, it was me and was already fixed
|
2021-02-15 16:37:20
|
<wikibugs>
|
('CR) ''jerkins-bot: [V: ''-1] interface automation: fix typo in method name (2) [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) (owner: ''Volans)'
|
2021-02-15 16:37:56
|
<wikibugs>
|
('PS2) ''Volans: interface automation: fix typo in method name (2) [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802)'
|
2021-02-15 16:39:57
|
<logmsgbot>
|
!log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1002.eqiad.wmnet
|
2021-02-15 16:40:00
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:40:06
|
<wikibugs>
|
('CR) ''Volans: [C: ''+2] "Typo fix." [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664309 (https://phabricator.wikimedia.org/T274802) (owner: ''Volans)'
|
2021-02-15 16:40:14
|
<wikibugs>
|
('PS1) ''Kosta Harlan: linkrecommendation: Set backoffLimit to 1 [deployment-charts] - ''https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893)'
|
2021-02-15 16:40:14
|
<icinga-wm>
|
PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
|
2021-02-15 16:40:45
|
<jayme>
|
^ thats "expected" (kind of) from reboots
|
2021-02-15 16:41:29
|
<wikibugs>
|
('CR) ''jerkins-bot: [V: ''-1] linkrecommendation: Set backoffLimit to 1 [deployment-charts] - ''https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: ''Kosta Harlan)'
|
2021-02-15 16:41:40
|
<icinga-wm>
|
RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
|
2021-02-15 16:43:00
|
<wikibugs>
|
('PS2) ''Kosta Harlan: linkrecommendation: Set backoffLimit to 1 [deployment-charts] - ''https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893)'
|
2021-02-15 16:43:18
|
<icinga-wm>
|
RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 16:44:44
|
<wikibugs>
|
'SRE, ''CAS-SSO, ''Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (''Gehel) Removing discovery-search, if you need our help again, please ping us!'
|
2021-02-15 16:46:44
|
<logmsgbot>
|
!log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1002.eqiad.wmnet
|
2021-02-15 16:46:49
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2021-02-15 16:48:30
|
<icinga-wm>
|
PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 16:48:50
|
<wikibugs>
|
'SRE, ''Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (''Volans) a:''Volans→''crusnov @crusnov passing it over to you. I've fixed the basic typos, but the problem now is that the scri...'
|
2021-02-15 16:49:43
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: cloudgw: switch data place interface config modes to manual [puppet] - ''https://gerrit.wikimedia.org/r/664311 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 16:49:51
|
<wikibugs>
|
'SRE, ''Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (''crusnov) That seems reasonable, I'll look at it and get a patch out soonish.'
|
2021-02-15 16:52:45
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] cloudgw: switch data place interface config modes to manual [puppet] - ''https://gerrit.wikimedia.org/r/664311 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 16:53:09
|
<icinga-wm>
|
PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
|
2021-02-15 16:57:37
|
<icinga-wm>
|
RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
|
2021-02-15 17:00:58
|
<wikibugs>
|
'SRE, ''Maps, ''Product-Infrastructure-Team-Backlog, ''Services, ''Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (''akosiaris) Thanks for this task! So I 've studied the diagrams a bit, they are helpful. The deployment pipeline definitely
suppor...'
|
2021-02-15 17:03:18
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+1] "Just to confirm - this will keep the cloudera components but clear all the pull-specific bits. If so, big +1, thanks :)" [puppet] - ''https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: ''Muehlenhoff)'
|
2021-02-15 17:16:13
|
<wikibugs>
|
('CR) ''Elukey: "John thanks a lot for the review! For this particular use case, I'd prefer to just move the existing code base to the class api and then m" [cookbooks] - ''https://gerrit.wikimedia.org/r/663878 (owner: ''Elukey)'
|
2021-02-15 17:27:06
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+2] hadoop: update the HDFS Namenode rack configuration [puppet] - ''https://gerrit.wikimedia.org/r/664302 (https://phabricator.wikimedia.org/T274795) (owner: ''Elukey)'
|
2021-02-15 17:28:16
|
<wikibugs>
|
('PS1) ''Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - ''https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573)'
|
2021-02-15 17:28:18
|
<wikibugs>
|
('PS1) ''Jcrespo: bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - ''https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182)'
|
2021-02-15 17:29:54
|
<wikibugs>
|
('Abandoned) ''Jcrespo: jessie: Remove old openssl override after revert to package version [puppet] - ''https://gerrit.wikimedia.org/r/660857 (https://phabricator.wikimedia.org/T273182) (owner: ''Jcrespo)'
|
2021-02-15 17:30:04
|
<wikibugs>
|
('CR) ''Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 17:32:07
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+1] linkrecommendation: Set backoffLimit to 1 [deployment-charts] - ''https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: ''Kosta Harlan)'
|
2021-02-15 17:32:43
|
<wikibugs>
|
('PS10) ''David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - ''https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412)'
|
2021-02-15 17:32:43
|
<icinga-wm>
|
RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2021-02-15 17:33:16
|
<wikibugs>
|
('CR) ''David Caro: "Done all the changes as requested" (''13 comments) [software/spicerack] - ''https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: ''David Caro)'
|
2021-02-15 17:39:15
|
<wikibugs>
|
('CR) ''Jcrespo: "Have you tested backups with the script on etcd3? I don't see anything, like a path, completely wrong, but I don't know enough about what " [puppet] - ''https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: ''Jcrespo)'
|
2021-02-15 17:41:17
|
<wikibugs>
|
'SRE, ''serviceops, ''Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (''jcrespo) I've sent: https://gerrit.wikimedia.org/r/c/operations/puppet/+/664313 Independently of the pace of upgrading, we should give some priority to generating fresh backups from the...'
|
2021-02-15 17:43:56
|
<wikibugs>
|
('PS2) ''Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - ''https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573)'
|
2021-02-15 17:44:23
|
<wikibugs>
|
('PS3) ''Jcrespo: configcluster: Enable etcd v3 backups for stretch hosts [puppet] - ''https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573)'
|
2021-02-15 17:55:42
|
<wikibugs>
|
('PS1) ''Arturo Borrero Gonzalez: cloudgw: interfaces: relax check on routing setup by using 'onlink' [puppet] - ''https://gerrit.wikimedia.org/r/664317 (https://phabricator.wikimedia.org/T272963)'
|
2021-02-15 17:57:40
|
<wikibugs>
|
('CR) ''Arturo Borrero Gonzalez: [C: ''+2] cloudgw: interfaces: relax check on routing setup by using 'onlink' [puppet] - ''https://gerrit.wikimedia.org/r/664317 (https://phabricator.wikimedia.org/T272963) (owner: ''Arturo Borrero Gonzalez)'
|
2021-02-15 17:59:36
|
<wikibugs>
|
('CR) ''Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - ''https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: ''Muehlenhoff)'
|
2021-02-15 18:00:04
|
<jouncebot>
|
ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1800).
|
2021-02-15 18:05:14
|
<wikibugs>
|
('CR) ''Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 18:10:38
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+1] "> Patch Set 1:" [cookbooks] - ''https://gerrit.wikimedia.org/r/663878 (owner: ''Elukey)'
|
2021-02-15 18:14:52
|
<wikibugs>
|
'SRE, ''DBA, ''serviceops, ''Goal, ''Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (''jcrespo)'
|
2021-02-15 18:15:15
|
<wikibugs>
|
'SRE, ''Data-Persistence-Backup, ''Goal, ''Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (''jcrespo)'
|
2021-02-15 18:15:40
|
<wikibugs>
|
('CR) ''Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 18:15:41
|
<wikibugs>
|
'SRE, ''Data-Persistence-Backup, ''Goal, ''Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (''jcrespo) ''Open→''Resolved Regarding the last 2 points, we
have, in a way, done the last point "parametrize better the jobdefaults i...'
|
2021-02-15 18:17:39
|
<icinga-wm>
|
PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
|
2021-02-15 18:28:38
|
<wikibugs>
|
('PS1) ''Effie Mouzeli: (WIP) mediawiki::alerts add alert when 20% of servers is saturated [puppet] - ''https://gerrit.wikimedia.org/r/664319 (https://phabricator.wikimedia.org/T267176)'
|
2021-02-15 18:33:52
|
<wikibugs>
|
('CR) ''Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 18:41:27
|
<icinga-wm>
|
PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
|
2021-02-15 18:41:47
|
<icinga-wm>
|
PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
|
2021-02-15 18:45:40
|
<jynus>
|
that looks like DPLA bot on commons
|
2021-02-15 18:46:29
|
<jynus>
|
I see no issues, but keep an eye in case something degrades (thumbail generation, codfw s4 replication, etc.)
|
2021-02-15 18:47:54
|
<jynus>
|
that's 10 1MB files per second
|
2021-02-15 18:48:16
|
<tabbycat>
|
jynus: swift is TimedMediaHandler or just the place where uploads are being stored?
|
2021-02-15 18:49:21
|
<jynus>
|
swift is our OpenStack Swift cluster, our backend storage for media and rendered stuff: https://wikitech.wikimedia.org/wiki/Swift
|
2021-02-15 18:49:59
|
<jynus>
|
the alert is just a warning on a high rate of uploads- that doesn't mean there is a problem, but it is an unusual state
|
2021-02-15 18:50:23
|
<jynus>
|
normally we worry when it is very low, because it means there is a problem with uploads
|
2021-02-15 19:00:04
|
<jouncebot>
|
RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T1900). Please do the needful.
|
2021-02-15 19:00:04
|
<jouncebot>
|
No GERRIT patches in the queue for this window AFAICS.
|
2021-02-15 19:00:56
|
<Urbanecm>
|
jynus: do we want to do T248177?
|
2021-02-15 19:00:56
|
<stashbot>
|
T248177: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177
|
2021-02-15 19:01:29
|
<Urbanecm>
|
(but 999 uploads per second is effectively no rate limit anyway :/ )
|
2021-02-15 19:02:09
|
<tabbycat>
|
999/s is o_O
|
2021-02-15 19:03:32
|
<tabbycat>
|
IIRC there is/was an UploadStash for large or batch uploads Urbanecm ?
|
2021-02-15 19:04:10
|
<Urbanecm>
|
there's still uploadstash, dunno if it helps with ratelimited uploads
|
2021-02-15 19:11:01
|
<icinga-wm>
|
PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
|
2021-02-15 19:21:03
|
<icinga-wm>
|
PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
|
2021-02-15 19:28:58
|
<wikibugs>
|
('CR) ''Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|
2021-02-15 19:31:51
|
<wikibugs>
|
('CR) ''CRusnov: "This change is ready for review." [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664332 (https://phabricator.wikimedia.org/T274802) (owner: ''CRusnov)'
|
2021-02-15 20:10:06
|
<wikibugs>
|
('PS1) ''Ladsgroup: [DNM] Test jenkins new rule on banning use of hiera() [puppet] - ''https://gerrit.wikimedia.org/r/664350'
|
2021-02-15 20:11:43
|
<wikibugs>
|
('CR) ''jerkins-bot: [V: ''-1] [DNM] Test jenkins new rule on banning use of hiera() [puppet] - ''https://gerrit.wikimedia.org/r/664350 (owner: ''Ladsgroup)'
|
2021-02-15 20:25:00
|
<wikibugs>
|
('Abandoned) ''Ladsgroup: [DNM] Test jenkins new rule on banning use of hiera() [puppet] - ''https://gerrit.wikimedia.org/r/664350 (owner: ''Ladsgroup)'
|
2021-02-15 20:30:51
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (''leila) approved. Thank you for your support!'
|
2021-02-15 20:46:21
|
<icinga-wm>
|
PROBLEM - MegaRAID on an-worker1097 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
|
2021-02-15 20:46:24
|
<icinga-wm>
|
ACKNOWLEDGEMENT - MegaRAID on an-worker1097 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T274819 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
|
2021-02-15 20:46:27
|
<wikibugs>
|
'SRE, ''ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (''ops-monitoring-bot)'
|
2021-02-15 20:47:01
|
<wikibugs>
|
'SRE, ''ops-eqiad, ''Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (''Peachey88)'
|
2021-02-15 21:00:04
|
<jouncebot>
|
chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T2100).
|
2021-02-15 21:51:52
|
<icinga-wm>
|
PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
|
2021-02-15 21:52:04
|
<icinga-wm>
|
PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
|
2021-02-15 22:00:04
|
<jouncebot>
|
Reedy and sbassett: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210215T2200).
|
2021-02-15 22:50:50
|
<wikibugs>
|
('CR) ''Volans: [C: ''+1] "Code looks good to me, please test it on netbox-next to be sure." (''1 comment) [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/664332 (https://phabricator.wikimedia.org/T274802) (owner: ''CRusnov)'
|
2021-02-15 22:52:34
|
<icinga-wm>
|
PROBLEM - Device not healthy -SMART- on an-worker1097 is CRITICAL: cluster=analytics device=sat+megaraid,13 instance=an-worker1097 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops
|
2021-02-15 23:31:52
|
<wikibugs>
|
('CR) ''Gergő Tisza: api-gateway: generic discovery service config option, add linkrecommendation (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: ''Hnowlan)'
|