[00:00:38] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:14] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 17 probes of 718 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:07:08] RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:32] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:50] PROBLEM - Check systemd state on elastic1043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:38] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 7 probes of 716 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:09:06] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 6 probes of 721 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:12:46] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:32] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:48] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:56] PROBLEM - Check systemd state on elastic1041 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:02] PROBLEM - Check systemd state on elastic1040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:16] RECOVERY - Check systemd state on elastic1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:22] RECOVERY - Check systemd state on elastic1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:26] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) Ok, It seems all 3 methods (keep on ignore list, ACK, don't ACK) are not ideal, so I am not sure how I should correctly handle it. I will just put it back on the ignore list then. [00:25:39] (03PS1) 10Dzahn: Revert "Revert "bacula: add people1003 job to monitoring ignorelist"" [puppet] - 10https://gerrit.wikimedia.org/r/684464 [00:27:22] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "bacula: add people1003 job to monitoring ignorelist"" [puppet] - 10https://gerrit.wikimedia.org/r/684464 (owner: 10Dzahn) [00:28:54] RECOVERY - Check systemd state on elastic1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:36] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) - host back on ignore list - icinga alert cleared - removed ACK from icinga check This can wait, there is no urgency to it. I am also not doing anything special here, so maybe it's all bullseye related. [00:33:48] PROBLEM - Mediawiki CirrusSearch update rate - eqiad on alert1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [00:35:05] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10Dzahn) a:03Dzahn [00:35:22] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10Dzahn) Thanks! will look at that (later) [00:35:56] PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:18] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Dzahn) 05Resolved→03Open [00:36:26] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:38] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:55] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Dzahn) p:05High→03Medium [00:38:31] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Dzahn) Maybe send it to that special "list of list owners" mailing list? [00:38:52] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:52] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Dzahn) p:05Triage→03Medium [00:40:21] 10SRE, 10Wikimedia-Mailing-lists: Enforce a consistent policy for disabled/archived mailing lists - https://phabricator.wikimedia.org/T281778 (10Dzahn) p:05Triage→03Medium [00:41:02] 10SRE, 10Wikimedia-Mailing-lists: Enforce a consistent policy for disabled/archived mailing lists - https://phabricator.wikimedia.org/T281778 (10Dzahn) When that script was written a long time ago it was the attempt to standardize what a disabled list is. Before that all of them were done slightly different.... [00:42:36] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:14] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:17] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Dzahn) pressemeldungen = "German Wikinews" I can ask on https://de.wikinews.org/wiki/Wikinews:Pressestammtisch [00:48:27] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Dzahn) p:05Triage→03Medium [00:51:56] RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:36] RECOVERY - Mediawiki CirrusSearch update rate - eqiad on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [01:04:22] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) The faulty switch was delivered to Juniper today. [01:41:17] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [01:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:27] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [01:42:13] (03PS1) 10Papaul: DHCP: Add phab2002 MACaddress [puppet] - 10https://gerrit.wikimedia.org/r/684599 (https://phabricator.wikimedia.org/T280544) [01:43:20] (03CR) 10Dzahn: [C: 03+1] DHCP: Add phab2002 MACaddress [puppet] - 10https://gerrit.wikimedia.org/r/684599 (https://phabricator.wikimedia.org/T280544) (owner: 10Papaul) [01:46:08] (03CR) 10Papaul: [C: 03+2] DHCP: Add phab2002 MACaddress [puppet] - 10https://gerrit.wikimedia.org/r/684599 (https://phabricator.wikimedia.org/T280544) (owner: 10Papaul) [01:52:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` phab2002.codfw.wmnet ` The log can be found in `/var/log/wmf-a... [01:59:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:01:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:07:57] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on phab2002.codfw.wmnet with reason: REIMAGE [02:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.4 [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/684603 [02:08:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.4 [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/684603 (owner: 10TrainBranchBot) [02:09:52] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on phab2002.codfw.wmnet with reason: REIMAGE [02:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:14] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab2002.codfw.wmnet'] ` and were **ALL** successful. [02:21:02] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10Papaul) [02:21:51] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10Papaul) 05Open→03Resolved @Dzahn all yours [02:26:59] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10Papaul) a:03Papaul [02:30:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.4 [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/684603 (owner: 10TrainBranchBot) [03:06:19] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:06:33] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:06:53] PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 414 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:07:37] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:43] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:14:05] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:43] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.235 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:19:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:24:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:25:31] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:59] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:26:13] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:35:40] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [03:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:50] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [03:36:22] !log T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` [03:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:09] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [03:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:38] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [03:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:49] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:17] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:03] PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:58:11] RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:58:55] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:59] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:11] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:31] PROBLEM - Check systemd state on elastic1050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:09] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:06:40] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [04:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:48] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [04:07:03] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:59] 10SRE, 10Wikimedia-Mailing-lists: Enforce a consistent policy for disabled/archived mailing lists - https://phabricator.wikimedia.org/T281778 (10Ladsgroup) +1 on my side [04:10:07] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:07] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Ladsgroup) >>! In T281753#7055632, @Dzahn wrote: > Maybe send it to that special "list of list owners" mailing list? That mailing list has around 1k members (and i... [04:15:09] 10SRE, 10Wikimedia-Mailing-lists: Rename mailinglists eliso, and eliso-anoncoj - https://phabricator.wikimedia.org/T281686 (10Ladsgroup) That sounds good to me. We can definitely try it. [04:15:27] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:31] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:29:25] RECOVERY - Check systemd state on elastic1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:52] (03PS1) 10Marostegui: db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/684667 (https://phabricator.wikimedia.org/T280492) [04:40:46] (03CR) 10Marostegui: [C: 03+2] db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/684667 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [04:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15674 and previous config saved to /var/cache/conftool/dbconfig/20210504-044101-root.json [04:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15675 and previous config saved to /var/cache/conftool/dbconfig/20210504-045605-root.json [04:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:25] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add discussiontools_subscription to private tables [puppet] - 10https://gerrit.wikimedia.org/r/683070 (https://phabricator.wikimedia.org/T263817) (owner: 10Bartosz Dziewoński) [05:07:58] !log Restart sanitarium hosts to pick up new filters T263817 [05:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:06] T263817: DBA review: conversation subscriptions - https://phabricator.wikimedia.org/T263817 [05:11:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15676 and previous config saved to /var/cache/conftool/dbconfig/20210504-051108-root.json [05:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:18] (03PS1) 10Marostegui: sanitarium_multiinstance.my.cnf: Add innodb_change_buffering [puppet] - 10https://gerrit.wikimedia.org/r/684672 (https://phabricator.wikimedia.org/T263443) [05:21:18] (03CR) 10Marostegui: [C: 03+2] sanitarium_multiinstance.my.cnf: Add innodb_change_buffering [puppet] - 10https://gerrit.wikimedia.org/r/684672 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui) [05:23:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:24:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:26:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15677 and previous config saved to /var/cache/conftool/dbconfig/20210504-052612-root.json [05:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:50] (03PS1) 10Marostegui: db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/684678 [05:30:37] (03CR) 10Marostegui: [C: 03+2] db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/684678 (owner: 10Marostegui) [05:31:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Repool db1118', diff saved to https://phabricator.wikimedia.org/P15678 and previous config saved to /var/cache/conftool/dbconfig/20210504-053149-root.json [05:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:56] !log Deploy schema change on s6 codfw, lag will appear - T266486 T268392 T273360 [05:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:05] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [05:37:06] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [05:37:06] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [05:41:12] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:43:34] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 2976859 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [05:45:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1158 to clone db1178 T275633', diff saved to https://phabricator.wikimedia.org/P15680 and previous config saved to /var/cache/conftool/dbconfig/20210504-054539-marostegui.json [05:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:48] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [05:45:58] !log Stop mysql on db1158 to clone db1178 [05:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: Repool db1118', diff saved to https://phabricator.wikimedia.org/P15682 and previous config saved to /var/cache/conftool/dbconfig/20210504-054653-root.json [05:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15683 and previous config saved to /var/cache/conftool/dbconfig/20210504-055020-root.json [05:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1167 to clone db1178 T275633', diff saved to https://phabricator.wikimedia.org/P15684 and previous config saved to /var/cache/conftool/dbconfig/20210504-055116-marostegui.json [05:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:24] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [05:55:37] (03PS1) 10Marostegui: mariadb: Productionize db1178 [puppet] - 10https://gerrit.wikimedia.org/r/684689 (https://phabricator.wikimedia.org/T275633) [05:56:14] (03PS1) 10Marostegui: db1167: Change its section [puppet] - 10https://gerrit.wikimedia.org/r/684690 [05:58:27] (03CR) 10Marostegui: [C: 03+2] db1167: Change its section [puppet] - 10https://gerrit.wikimedia.org/r/684690 (owner: 10Marostegui) [06:01:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: Repool db1118', diff saved to https://phabricator.wikimedia.org/P15686 and previous config saved to /var/cache/conftool/dbconfig/20210504-060156-root.json [06:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15687 and previous config saved to /var/cache/conftool/dbconfig/20210504-060523-root.json [06:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: Repool db1118', diff saved to https://phabricator.wikimedia.org/P15688 and previous config saved to /var/cache/conftool/dbconfig/20210504-061700-root.json [06:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15689 and previous config saved to /var/cache/conftool/dbconfig/20210504-062027-root.json [06:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15690 and previous config saved to /var/cache/conftool/dbconfig/20210504-063530-root.json [06:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15691 and previous config saved to /var/cache/conftool/dbconfig/20210504-065034-root.json [06:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:07] (03PS4) 10JMeybohm: Remove unused profile::etcd and related classes [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) [07:00:35] (03CR) 10JMeybohm: Remove unused profile::etcd and related classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:02:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1178 [puppet] - 10https://gerrit.wikimedia.org/r/684689 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [07:04:15] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29364/console" [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:04:54] (03PS4) 10JMeybohm: Rename role configcluster_stretch to configcluster [puppet] - 10https://gerrit.wikimedia.org/r/683551 (https://phabricator.wikimedia.org/T271573) [07:05:06] (03PS5) 10JMeybohm: Remove unused profile::etcd and related classes [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) [07:05:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:11:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161 and db1082 to change s5 sanitarium master T280492', diff saved to https://phabricator.wikimedia.org/P15692 and previous config saved to /var/cache/conftool/dbconfig/20210504-071146-marostegui.json [07:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:54] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [07:16:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 10%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15693 and previous config saved to /var/cache/conftool/dbconfig/20210504-071623-root.json [07:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15694 and previous config saved to /var/cache/conftool/dbconfig/20210504-071632-root.json [07:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:45] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:18:59] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:22:32] (03CR) 10Elukey: [C: 03+1] "IPs:ports looks good, IIUC we are adding egress rules only to changeprop since eventgate/eventstreams already got the new IPs in a separat" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [07:23:51] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Ladsgroup) So archiving a mailing list in mm3 is not that specified (or at least I didn't find anything, there's delete but that's not what we want here). ac-temp is on emergency moderation... [07:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 25%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15695 and previous config saved to /var/cache/conftool/dbconfig/20210504-073127-root.json [07:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15696 and previous config saved to /var/cache/conftool/dbconfig/20210504-073135-root.json [07:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:43] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) These are references to people1003 on backups. There are no recent failures ` root@backup1001:~$ grep people1003.eqiad.wmnet /var/log/bacula/log.1 27-Apr 04:54 backup1001.eqiad.wmnet JobId 329473: S... [07:46:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 50%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15697 and previous config saved to /var/cache/conftool/dbconfig/20210504-074632-root.json [07:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15698 and previous config saved to /var/cache/conftool/dbconfig/20210504-074639-root.json [07:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:00] RECOVERY - Disk space on backup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [08:02:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 from s1 vslow to get its tables checked and pool db1099:3311 instead T280492', diff saved to https://phabricator.wikimedia.org/P15699 and previous config saved to /var/cache/conftool/dbconfig/20210504-080206-marostegui.json [08:02:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 75%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15700 and previous config saved to /var/cache/conftool/dbconfig/20210504-080212-root.json [08:02:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15701 and previous config saved to /var/cache/conftool/dbconfig/20210504-080213-root.json [08:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:15] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [08:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:58] !log Check tables on db1106, lag will show up on s1 on wiki replicas (T280492) [08:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:20] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/684795 (https://phabricator.wikimedia.org/T280492) [08:05:04] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/684795 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [08:05:09] (03CR) 10JMeybohm: [C: 03+2] Rename role configcluster_stretch to configcluster [puppet] - 10https://gerrit.wikimedia.org/r/683551 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [08:05:15] (03CR) 10JMeybohm: [C: 03+2] Remove unused profile::etcd and related classes [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [08:07:03] (03PS3) 10Volans: sre.hosts.remove-downtime: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 [08:17:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15702 and previous config saved to /var/cache/conftool/dbconfig/20210504-081716-root.json [08:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:37] (03PS1) 10Kosta Harlan: GrowthExperiments: Rename control variant to 'control' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684799 (https://phabricator.wikimedia.org/T281727) [08:30:16] (03PS2) 10Muehlenhoff: install_server: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671088 [08:31:25] (03CR) 10Volans: "Final couple of nits inline and it's ready!" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [08:33:52] (03CR) 10Majavah: [C: 04-1] openstack: neutron: topology changes for cloudgw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [08:43:49] (03PS1) 10JMeybohm: prometheus: Clean up absent file resource [puppet] - 10https://gerrit.wikimedia.org/r/684801 (https://phabricator.wikimedia.org/T271573) [08:45:22] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Clean up absent file resource [puppet] - 10https://gerrit.wikimedia.org/r/684801 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [08:46:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, one comment inline, but feel free to ignore:-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684437 (owner: 10Jbond) [08:46:38] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:48:56] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:53:41] (03CR) 10Marostegui: [C: 03+1] mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [08:57:36] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Aklapper) Could someone please answer Schlurcher's question? Thanks in advance. [09:01:13] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.50 [software/spicerack] - 10https://gerrit.wikimedia.org/r/684809 [09:02:18] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [09:02:44] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10jcrespo) Adding current SRE clinic duty person: @Dzahn My guess is someone from service ops should follow up, but for him to decide. [09:03:30] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) [09:04:15] (03PS4) 10Jbond: C:gitlab::ssh: add new gilab::ssh class [puppet] - 10https://gerrit.wikimedia.org/r/684437 [09:04:17] (03CR) 10Jbond: "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684437 (owner: 10Jbond) [09:04:41] !log rolling restart of cassandra in codfw to pick up Java security updates [09:04:44] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [09:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:47] (03PS3) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683370 (https://phabricator.wikimedia.org/T280641) [09:08:49] (03PS2) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683888 (https://phabricator.wikimedia.org/T280641) [09:08:51] (03PS4) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683371 (https://phabricator.wikimedia.org/T280641) [09:08:53] (03PS1) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) [09:10:34] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10Silvan_WMDE) Sorry, I don't have kerberos credentials yet - is it possible to get them, too? Thx. [09:11:57] (03CR) 10jerkins-bot: [V: 04-1] wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [09:13:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/684437 (owner: 10Jbond) [09:14:15] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10Addshore) >>! In T280541#7056248, @Silvan_WMDE wrote: > Sorry, I don't have kerberos credentials yet - is it possible to get them, too? Thx. I believe https://wikitech.... [09:21:38] (03PS1) 10Giuseppe Lavagetto: kubernetes::global_config: add ipv6 for kafka [puppet] - 10https://gerrit.wikimedia.org/r/684814 [09:23:11] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::global_config: add ipv6 for kafka [puppet] - 10https://gerrit.wikimedia.org/r/684814 (owner: 10Giuseppe Lavagetto) [09:30:02] (03CR) 10Volans: "I've added some context info that might be useful inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [09:31:45] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.50 [software/spicerack] - 10https://gerrit.wikimedia.org/r/684809 (owner: 10Volans) [09:37:15] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.50 [software/spicerack] - 10https://gerrit.wikimedia.org/r/684809 (owner: 10Volans) [09:37:30] (03PS1) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [09:39:56] (03CR) 10jerkins-bot: [V: 04-1] Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [09:41:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P15703 and previous config saved to /var/cache/conftool/dbconfig/20210504-094138-root.json [09:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:36] (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: topology changes for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) [09:45:35] !log +50G for prometheus k8s in codfw [09:45:41] (03CR) 10Arturo Borrero Gonzalez: openstack: neutron: topology changes for cloudgw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [09:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:18] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10Silvan_WMDE) >>! In T280541#7056268, @Addshore wrote: > I believe https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Get_a_password_for_Kerberos wo... [09:50:11] (03Abandoned) 10Kosta Harlan: Echo: Enable poll for updates feature on testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530639 (https://phabricator.wikimedia.org/T219222) (owner: 10Kosta Harlan) [09:54:08] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) [09:54:46] (03PS3) 10Jbond: P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 [09:54:53] (03PS4) 10Jbond: P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 [09:55:02] (03PS2) 10Jbond: P:gitlab: manage gitlab with gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/684487 [09:55:04] (03CR) 10Zoranzoki21: [C: 04-1] "Tabs should be used instead of spaces for indentation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [09:55:17] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [09:55:28] (03PS2) 10Jbond: P:gitlab: add ability to manage gitlab sshd instance [puppet] - 10https://gerrit.wikimedia.org/r/684438 (https://phabricator.wikimedia.org/T276148) [09:55:37] (03PS2) 10Jbond: O:gitlab: manage sshd config [puppet] - 10https://gerrit.wikimedia.org/r/684439 (https://phabricator.wikimedia.org/T276148) [09:55:39] (03CR) 10jerkins-bot: [V: 04-1] P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 (owner: 10Jbond) [09:56:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P15704 and previous config saved to /var/cache/conftool/dbconfig/20210504-095642-root.json [09:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:06] (03CR) 10jerkins-bot: [V: 04-1] P:gitlab: manage gitlab with gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/684487 (owner: 10Jbond) [09:57:18] !log installing bind9 security updates on buster (client side tools/libs only) [09:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:25] (03CR) 10jerkins-bot: [V: 04-1] P:gitlab: add ability to manage gitlab sshd instance [puppet] - 10https://gerrit.wikimedia.org/r/684438 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [09:57:44] (03PS2) 10JMeybohm: prometheus: Clean up absent file resource [puppet] - 10https://gerrit.wikimedia.org/r/684801 (https://phabricator.wikimedia.org/T271573) [09:57:46] (03PS1) 10JMeybohm: puppet_compiler: Remove etcd and conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/684848 (https://phabricator.wikimedia.org/T271573) [09:58:12] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) [09:59:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] puppet_compiler: Remove etcd and conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/684848 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [09:59:51] (03PS1) 10Volans: Upstream release v0.0.50 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684849 [09:59:53] (03CR) 10Zoranzoki21: "This is okay now. 😎" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [10:00:15] (03CR) 10Muehlenhoff: [C: 04-1] "Will adapt to class API" [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [10:02:40] (03CR) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [10:03:08] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684799 (https://phabricator.wikimedia.org/T281727) (owner: 10Kosta Harlan) [10:04:12] (03PS5) 10Jbond: P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 [10:04:54] (03CR) 10jerkins-bot: [V: 04-1] P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 (owner: 10Jbond) [10:09:12] (03CR) 10Hashar: "That is really specific to our CI system, we don't have a HOME directory in the containers. Upstreamed as:" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684440 (owner: 10Hashar) [10:11:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P15705 and previous config saved to /var/cache/conftool/dbconfig/20210504-101145-root.json [10:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:16] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.50 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684849 (owner: 10Volans) [10:14:49] (03CR) 10JMeybohm: rdf-streaming-updater: enable HA capability (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [10:14:51] (03PS1) 10Kosta Harlan: Rename variant control group to 'control' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684829 (https://phabricator.wikimedia.org/T281727) [10:14:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29368/console" [puppet] - 10https://gerrit.wikimedia.org/r/684848 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:15:17] (03PS1) 10Kosta Harlan: Rename variant control group to 'control' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/684830 (https://phabricator.wikimedia.org/T281727) [10:16:25] (03PS2) 10Giuseppe Lavagetto: kubernetes::global_config: add ipv6 for kafka [puppet] - 10https://gerrit.wikimedia.org/r/684814 [10:17:26] (03PS1) 10Urbanecm: Enable Growth team features in dark mode on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684851 (https://phabricator.wikimedia.org/T280824) [10:17:59] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::global_config: add ipv6 for kafka [puppet] - 10https://gerrit.wikimedia.org/r/684814 (owner: 10Giuseppe Lavagetto) [10:18:01] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29369/console" [puppet] - 10https://gerrit.wikimedia.org/r/684814 (owner: 10Giuseppe Lavagetto) [10:19:11] (03Merged) 10jenkins-bot: Upstream release v0.0.50 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684849 (owner: 10Volans) [10:21:45] <_joe_> jayme: please merge your patch so that I can re-run ci and merge mine :P [10:21:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I like the idea!" [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [10:26:07] (03CR) 10Muehlenhoff: [C: 03+2] install_server: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671088 (owner: 10Muehlenhoff) [10:26:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P15707 and previous config saved to /var/cache/conftool/dbconfig/20210504-102649-root.json [10:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:26] <_joe_> jbond42: any reason not to merge jayme's patch? it's basically blocking CI for any profile [10:31:51] _joe_: i think its fine to merge if we have issues we can revert that and the one to remove etcd and investigate further [10:32:18] was chating to jayme in pm think they are merging now [10:32:31] will do [10:32:36] (03CR) 10Jbond: [V: 03+1 C: 03+1] puppet_compiler: Remove etcd and conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/684848 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:32:48] (03CR) 10JMeybohm: [C: 03+2] puppet_compiler: Remove etcd and conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/684848 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [10:35:37] jayme: _joe_: puppets running fine on the compilers now [10:36:55] jbond42: cool. Thanks for checking! [10:37:27] np [10:37:28] <_joe_> yeah the risk was actually that if anyone wanted to connect to etcd, it would fail [10:37:40] <_joe_> it's still possible it will happen if you actually stop etcd [10:38:06] should we then do so to figure out? [10:38:29] <_joe_> let me check one thing first [10:40:01] <_joe_> yeah I can confirm it's not needed [10:40:20] <_joe_> we mock /etc/conftool-state/mediawiki.yaml in the compiler class [10:41:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/684814 (owner: 10Giuseppe Lavagetto) [10:41:41] ok. I'll stop etcd on all compiler100[1-3] then (cc jbond42) [10:42:00] ack if you stop it jayme ill purge them tomorrow [10:42:13] okay [10:43:13] done [10:43:17] thanks [10:44:30] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] kubernetes::global_config: add ipv6 for kafka [puppet] - 10https://gerrit.wikimedia.org/r/684814 (owner: 10Giuseppe Lavagetto) [10:48:41] (03PS1) 10Giuseppe Lavagetto: kafka egress: consume CIDRs from configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/684854 [10:48:43] (03PS1) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [10:48:45] (03PS1) 10Giuseppe Lavagetto: eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 [10:49:54] (03CR) 10jerkins-bot: [V: 04-1] eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [10:49:56] (03CR) 10jerkins-bot: [V: 04-1] eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (owner: 10Giuseppe Lavagetto) [10:52:19] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1012 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:53:39] (03PS6) 10Jbond: P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 [10:54:35] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:54:37] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:54:46] (03PS2) 10Hashar: [WMF] Add XDG_CACHE_HOME to tools/download_file.py [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684440 [10:54:48] (03PS3) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 [10:54:51] (03PS4) 10Hashar: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1100). Please do the needful. [11:00:04] kostajh, Urbanecm, and Nikerabbit: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] here [11:00:12] I can deploy today [11:00:42] (03CR) 10Urbanecm: [C: 03+2] Rename variant control group to 'control' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684829 (https://phabricator.wikimedia.org/T281727) (owner: 10Kosta Harlan) [11:00:44] (03CR) 10Urbanecm: [C: 03+2] Rename variant control group to 'control' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/684830 (https://phabricator.wikimedia.org/T281727) (owner: 10Kosta Harlan) [11:01:00] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [11:01:23] (03PS7) 10Urbanecm: Disable ContentTranslation New article campaign in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672416 (https://phabricator.wikimedia.org/T277473) (owner: 1001miki10) [11:01:28] (03CR) 10Urbanecm: [C: 03+2] Disable ContentTranslation New article campaign in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672416 (https://phabricator.wikimedia.org/T277473) (owner: 1001miki10) [11:01:29] \o [11:02:01] Urbanecm: Hi! Sorry I know the calendar is a bit full. Do you think we'd have time to add a config change at the end? [11:02:14] jan_drewniak: likely :). Add it to the calendar and we'll see :) [11:02:45] (03Merged) 10jenkins-bot: Disable ContentTranslation New article campaign in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672416 (https://phabricator.wikimedia.org/T277473) (owner: 1001miki10) [11:03:15] kostajh: hi Kosta, quick question about your patches: we'll have to deploy either the backport first and config later, or the other way around. What will happen in the (short) while when backport is there, but config not? [11:03:43] Nikerabbit: your patch is available on mwdebug1001, please test. [11:04:08] Urbanecm: ay ay [11:04:50] looks good to me [11:05:17] thanks, syncing [11:06:10] Urbanecm: how long of a window are we talking about? [11:06:26] kostajh: two/three minutes max [11:06:39] Urbanecm: I suppose what I should have done is left 'null' as a valid variant. I could make another patch [11:06:48] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8228f6beacd2f7e94a65f32d41f558c0f440db0a: Disable ContentTranslation New article campaign in fiwiki (T277473) (duration: 00m 59s) [11:06:54] Nikerabbit: deployed [11:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:56] T277473: Disable ContentTranslation New article campaign in fiwiki - https://phabricator.wikimedia.org/T277473 [11:07:04] kostajh: yeah, i should've probably realized that sooner. Can you do that please? [11:07:30] Urbanecm: do you want me to revert the one that was merged already? Or make a new patch on top of HEAD? [11:07:34] Urbanecm: thanks [11:08:01] kostajh: i think we can do a new patch on top of HEAD [11:08:14] np Nikerabbit [11:10:29] !log Create growthexperiments_link_recommendations and growthexperiments_link_submissions on arwiki,bnwiki,viwiki x1 (T266913) [11:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:37] T266913: Add a link engineering: create tables in Wikimedia production - https://phabricator.wikimedia.org/T266913 [11:11:18] Urbanecm: done in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/684857 [11:12:05] kostajh: this is basically fine to be reverted once the config is updated, right? [11:12:11] Urbanecm: yes [11:12:31] okay. I'll merge it to wmf branches only then, and only +1 the patch to record approval. [11:14:14] done as https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/684832 [11:14:52] Urbanecm: thanks. I guess we don't need it for wmf.4 either [11:15:05] yeah, only wmf.3 done - no other version is live now. [11:15:39] kostajh: and for add a link, creating tables is the only requirement, right? [11:15:49] Urbanecm: that's right [11:15:58] great. Merging the patch. [11:16:03] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [11:16:09] (03PS3) 10Urbanecm: GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [11:16:15] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [11:17:01] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendations for target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684825 (https://phabricator.wikimedia.org/T278710) (owner: 10Kosta Harlan) [11:18:18] kostajh: patch pulled onto mwdebug1001 if you want to have a look. [11:18:35] ok [11:19:13] Urbanecm: which one, the enable link recommendations one? [11:19:19] kostajh: yes [11:20:51] Urbanecm: are you able to add the link-recommendation task type to MediaWiki:NewcomerTasks.json for ar/bn/vi wiki? [11:21:02] oh, right, we need to do that as well. Sure, will do it. [11:22:21] Urbanecm: afer the patch is synced though [11:22:28] Urbanecm: but yeah the patch looks good to me [11:22:35] oh, am i not supposed to do it before? [11:23:07] Urbanecm: ah, well actually I don't think it matters [11:23:26] okay [11:24:43] Done on all three. [11:24:56] kostajh: okay to sync now? [11:25:36] Urbanecm: yes please [11:25:40] okay, oding. [11:26:03] (03Merged) 10jenkins-bot: Temporarily re-add 'null' control group [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684832 (https://phabricator.wikimedia.org/T281727) (owner: 10Urbanecm) [11:27:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 87dff0b1abe588f0ddc62985fdb40b5ec0fa1bbd: GrowthExperiments: Enable link recommendations for target wikis (T278710) (duration: 00m 57s) [11:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:09] T278710: Add a link: production deployment - https://phabricator.wikimedia.org/T278710 [11:27:32] kostajh: should be live. [11:28:19] Urbanecm: nice. So, next up is the control group GrowthExperiments patch + mediawiki-config patch? [11:28:23] yup [11:28:49] (03PS2) 10Urbanecm: GrowthExperiments: Rename control variant to 'control' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684799 (https://phabricator.wikimedia.org/T281727) (owner: 10Kosta Harlan) [11:28:53] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Rename control variant to 'control' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684799 (https://phabricator.wikimedia.org/T281727) (owner: 10Kosta Harlan) [11:29:12] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DON'T MERGE THIS" [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [11:29:29] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29371/" [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [11:29:41] (03Merged) 10jenkins-bot: GrowthExperiments: Rename control variant to 'control' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684799 (https://phabricator.wikimedia.org/T281727) (owner: 10Kosta Harlan) [11:30:26] kostajh: both should be on mwdebug1001 if you want to take a look. [11:30:43] Urbanecm: ok [11:31:15] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: add cloudsw addresses in vlan 1120 [dns] - 10https://gerrit.wikimedia.org/r/684353 (https://phabricator.wikimedia.org/T270704) [11:31:40] !log Run `User::newSystemUser( 'Maintenance script', [ 'steal' => true ] );` on arwiki, bnwiki, viwiki (T278710, T281703) [11:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:49] T281703: TypeError: Argument 1 passed to GrowthExperiments\NewcomerTasks\TaskSuggester\CacheDecorator::suggest() must implement interface MediaWiki\User\UserIdentity, null given, called in /srv/mediawiki/php-1.37.0-wmf.3/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php on line 170 - https://phabricator.wikimedia.org/T281703 [11:32:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: add cloudsw addresses in vlan 1120 [dns] - 10https://gerrit.wikimedia.org/r/684353 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [11:32:32] (03PS2) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [11:33:05] Urbanecm: hmm, I see the link recommendation task type with a newly created user account. On mwdebug1001 [11:33:24] that sounds like...something that shouldn't happen [11:33:49] should i revert? [11:33:56] Urbanecm: just a sec [11:34:01] sure, waiting. [11:35:13] (03CR) 10jerkins-bot: [V: 04-1] Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [11:39:30] Urbanecm: can you double check the value of `GEHomepageDefaultVariant` on the mwdebug1001 host please? [11:40:01] kostajh: it's "control" on cswiki https://www.irccloud.com/pastebin/SZq3S6sy/ [11:40:06] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: update names for cloudgw migration [dns] - 10https://gerrit.wikimedia.org/r/684864 (https://phabricator.wikimedia.org/T270704) [11:40:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [11:40:40] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DON'T MERGE THIS. This patch requires a coordinated operation." [dns] - 10https://gerrit.wikimedia.org/r/684864 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [11:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] and it's also `control` for viwiki, bnwiki and arwiki [11:41:24] Urbanecm: ugh, I think I missed something. `GEHomepageNewAccountVariants` should list `control` but it just has `null`. [11:41:47] ah, that might explain it. Can you upload a patch for that please? [11:41:58] yep [11:42:01] thanks [11:42:34] Urbanecm: in a chain on top of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/684857 ? or separate? [11:42:54] it can go separately now that the patches are merged [11:43:15] (i mean https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/684832, the wmf.3 version) [11:46:47] Urbanecm: actually it seems like a separate problem. [11:47:01] :( [11:47:04] https://www.irccloud.com/pastebin/LAuP6rJY/ [11:47:11] that's on mwdebug1001 [11:47:14] that...doesn't look right [11:47:26] I don't understand why that is happening [11:49:25] kostajh: linkrecommendation is 100 on prod as well. Are we sure we don't accidentally serve linkrecommendation to users? https://www.irccloud.com/pastebin/LdZDaIKn/ [11:49:47] Urbanecm: ah... hold on [11:49:58] sure, waiting :) [11:50:34] Urbanecm: do we need merge_strategy=array_plus_2d for this config in extension.json? [11:51:50] kostajh: honestly I'm not sure. https://www.mediawiki.org/wiki/Manual:Extension.json/Schema#Merge_strategies says that's for arrays with depth of 2, and that array_plus is for numeric indexes. We don't use meet any of those [11:52:04] Urbanecm: for now I might just update extension.json to set "linkrecommendation": 0 for $wgGEHomepageNewAccountVariants [11:54:01] Urbanecm: done in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/684869 [11:54:22] kostajh: I just tried that live on mwdebug1001, and it results in this [11:54:25] https://www.irccloud.com/pastebin/E37MEFnC/ [11:54:36] which also doesn't sound right. I incline to revert and debug later, if possible. [11:55:03] that is what we want -- we don't want any new accounts opted in to the linkrecommendation variant yet [11:56:20] actually, that happened when I set it to {} in extension.json (an empty array) [11:56:26] with your patch, it's this [11:56:29] https://www.irccloud.com/pastebin/YkRBWzhJ/ [11:56:34] lol [11:56:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 to upgrade its mysql T281212', diff saved to https://phabricator.wikimedia.org/P15710 and previous config saved to /var/cache/conftool/dbconfig/20210504-115634-marostegui.json [11:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:44] T281212: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 [11:56:51] * Urbanecm is confused [11:57:41] actually, i might know a simpler solution, wait a sec [11:58:40] or also maybe not. [11:58:42] !log Upgrade mysql and kernel on db1120 T281212 [11:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:05] kostajh: so, what do you think, do we set it to `{}` in extension.json now, or revert all the things? [11:59:06] (03PS5) 10Hnowlan: eventlogging: remove mariadb profile and create log dir [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) [11:59:18] Urbanecm: can you check with array_plus_2d ? [11:59:47] (03PS1) 10Jcrespo: dbbackups: Add s3 to db1102 and s2 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/684873 (https://phabricator.wikimedia.org/T280979) [11:59:59] (03PS2) 10Jcrespo: dbbackups: Add s3 to db1102 and s2 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/684873 (https://phabricator.wikimedia.org/T280979) [12:00:12] kostajh: looks the same as the original one https://www.irccloud.com/pastebin/LSVutgfg/ [12:00:47] Urbanecm: ok. Let me try one other thing... Reverting everything will be a pain. [12:00:54] But I recognize it has been a painful hour here already :) [12:01:12] yeah, more difficult than i would've expected [12:01:36] kostajh: for the record, you can edit files with `sudo -u mwdeploy vim /srv/mediawiki/php-1.37.0-wmf.3/extensions/GrowthExperiments/extension.json` on mwdebug1001 if you wanna try there. [12:02:30] (03PS3) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [12:03:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repool db1120 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15711 and previous config saved to /var/cache/conftool/dbconfig/20210504-120337-root.json [12:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:34] https://www.irccloud.com/pastebin/3DFuq9uA/ [12:04:54] https://www.irccloud.com/pastebin/aHjjVi0f/ [12:05:11] that looks OK? [12:05:19] why did i get the 200 there before... [12:05:25] yeah, that looks good. [12:05:30] can you maybe try with a new account too? [12:05:36] (03CR) 10jerkins-bot: [V: 04-1] Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [12:06:05] Urbanecm: yeah. So, i think the issue is that `"control": 100` is missing from GrowthExperiments [12:06:09] Urbanecm: will test with new account [12:06:24] thanks [12:07:28] confirmed i got the same result on mwdebug1002, i must've made a silly mistake when editing it [12:07:31] Urbanecm: can confirm that the new account is in `control` [12:07:36] great. [12:07:49] so the plan is to merge & backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/684869 and sync it all? [12:08:19] sorry jan_drewniak, Kosta's patches turned out to be much more complicated than they were supposed to be :( [12:08:34] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [12:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:04] Urbanecm: yeah no problem, I'll schedule another window [12:09:08] thanks! [12:09:11] kostajh: could you please confirm the plan i posted few lines above? [12:09:12] Urbanecm: ah, in mediawiki-config we should be setting 'linkrecommendation' to 0 [12:09:53] Urbanecm: I think an easier fix might be to set linkrecommendation => 0 in mediawiki-config [12:10:00] let's try that. [12:10:01] Urbanecm: but we can do both [12:10:12] (I'm on mwdebug1002 now, so i don't tamper with mwdebug1001 while you're there) [12:10:31] Urbanecm: ok. Can you test making that change on mediawiki-config on mwdebug1002 or do you need a patch? [12:10:39] I'm testing it there [12:13:01] okay, sounds to work https://www.irccloud.com/pastebin/Q6iCdXsm/ [12:13:06] (03PS1) 10Urbanecm: GrowthExperiments: Set linkrecommendation variant to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684878 (https://phabricator.wikimedia.org/T281727) [12:13:15] kostajh: can you check the patch ^^? [12:13:23] Urbanecm: looking [12:13:55] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Set linkrecommendation variant to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684878 (https://phabricator.wikimedia.org/T281727) (owner: 10Urbanecm) [12:14:00] thanks [12:14:19] 😅 that was fun [12:14:43] (03Merged) 10jenkins-bot: GrowthExperiments: Set linkrecommendation variant to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684878 (https://phabricator.wikimedia.org/T281727) (owner: 10Urbanecm) [12:14:48] (03PS4) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [12:14:50] yeah :). This stuff happen regularly, to remind you regardless how it looks like, deployment is not an easy job :D [12:15:41] Urbanecm: yep [12:15:58] so, pulled the patch to mwdebug1001, and this is what i see [12:16:00] https://www.irccloud.com/pastebin/ZLGIM6OM/ [12:16:09] kostajh: ok to sync? [12:16:44] Urbanecm: let me check that I reverted my extensoin.json change there [12:16:49] i ran scap pull [12:17:02] that downloaded everything from deploy1001 [12:17:04] (03PS3) 10Hnowlan: api-gateway: Create individual cluster definitions for read and write [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) [12:17:14] Urbanecm: got it. Shall I test making a new account just to be sure? [12:17:25] please do kostajh [12:17:31] ack [12:17:33] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29373/console" [puppet] - 10https://gerrit.wikimedia.org/r/683831 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [12:18:21] Urbanecm: it works :D [12:18:33] excellent. Let's sync it then! [12:18:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repool db1120 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15712 and previous config saved to /var/cache/conftool/dbconfig/20210504-121841-root.json [12:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:55] please do [12:20:49] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/GrowthExperiments/: 8f938c2: c8c07ab: GrowthExperiments backports (T281727) (duration: 00m 59s) [12:20:55] backport done, doing config [12:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:58] T281727: Rename GrowthExperiments default user variant to 'control' - https://phabricator.wikimedia.org/T281727 [12:21:37] (03CR) 10jerkins-bot: [V: 04-1] Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [12:22:48] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 683b876: 5763630: GrowthExperiments: Rename control variant to control, GrowthExperiments: Set linkrecommendation variant to 0 (T281727) (duration: 00m 58s) [12:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] kostajh: should be live. [12:23:35] Urbanecm: great, verifying [12:24:02] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Create individual cluster definitions for read and write [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [12:24:35] Urbanecm: looks good [12:24:43] nice! [12:25:15] I'm still a little uncertain about whether we opted new uesrs into linkrecommendation variant yesterday after enabling on cswiki, but will have a look [12:25:31] (03Merged) 10jenkins-bot: api-gateway: Create individual cluster definitions for read and write [deployment-charts] - 10https://gerrit.wikimedia.org/r/682921 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [12:25:33] I didn't see any edits tagged at the very least :) [12:27:11] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:27:11] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [12:27:14] Urbanecm: looks like just 6 users [12:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:32] 10SRE: Add PKI root CA to ca-certificates via puppet - https://phabricator.wikimedia.org/T281376 (10jbond) This has been completed [12:28:34] yes, looks so (one of them is me with the JS snippet) [12:28:37] (03PS5) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [12:28:41] 10SRE: Add PKI root CA to ca-certificates via puppet - https://phabricator.wikimedia.org/T281376 (10jbond) 05Open→03Resolved [12:28:43] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [12:28:44] so we did, but not for a lot of users. [12:29:35] (03PS1) 10Jbond: cfssl: add revoke functionality to cfssl_cert [puppet] - 10https://gerrit.wikimedia.org/r/684884 (https://phabricator.wikimedia.org/T281366) [12:30:17] (03CR) 10jerkins-bot: [V: 04-1] cfssl: add revoke functionality to cfssl_cert [puppet] - 10https://gerrit.wikimedia.org/r/684884 (https://phabricator.wikimedia.org/T281366) (owner: 10Jbond) [12:30:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29374/console" [puppet] - 10https://gerrit.wikimedia.org/r/684884 (https://phabricator.wikimedia.org/T281366) (owner: 10Jbond) [12:33:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repool db1120 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15713 and previous config saved to /var/cache/conftool/dbconfig/20210504-123344-root.json [12:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751 [12:34:39] (03PS2) 10Jbond: cfssl: add revoke functionality to cfssl_cert [puppet] - 10https://gerrit.wikimedia.org/r/684884 (https://phabricator.wikimedia.org/T281366) [12:34:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751 [12:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:44] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [12:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29375/console" [puppet] - 10https://gerrit.wikimedia.org/r/684884 (https://phabricator.wikimedia.org/T281366) (owner: 10Jbond) [12:35:38] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling for sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15714 and previous config saved to /var/cache/conftool/dbconfig/20210504-123537-kormat.json [12:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:55] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [12:37:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl: add revoke functionality to cfssl_cert [puppet] - 10https://gerrit.wikimedia.org/r/684884 (https://phabricator.wikimedia.org/T281366) (owner: 10Jbond) [12:43:27] (03PS1) 10Jbond: cfssl-certs: fix nargs argument [puppet] - 10https://gerrit.wikimedia.org/r/684891 [12:44:17] (03CR) 10Jbond: [C: 03+2] cfssl-certs: fix nargs argument [puppet] - 10https://gerrit.wikimedia.org/r/684891 (owner: 10Jbond) [12:46:48] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15715 and previous config saved to /var/cache/conftool/dbconfig/20210504-124647-kormat.json [12:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:57] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [12:48:41] (03PS1) 10Jbond: cfssl-certs: remove short option for dp-config [puppet] - 10https://gerrit.wikimedia.org/r/684894 [12:48:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repool db1120 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15716 and previous config saved to /var/cache/conftool/dbconfig/20210504-124848-root.json [12:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:27] (03PS1) 10Kormat: mariadb: Remove obsolete comment. [puppet] - 10https://gerrit.wikimedia.org/r/684895 (https://phabricator.wikimedia.org/T280751) [12:49:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 to upgrade its mysql T281212', diff saved to https://phabricator.wikimedia.org/P15717 and previous config saved to /var/cache/conftool/dbconfig/20210504-124937-marostegui.json [12:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:46] T281212: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 [12:50:18] !log Upgrade mysql and kernel on db1137 T281212 [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:40] (03CR) 10Kormat: [C: 03+2] mariadb: Remove obsolete comment. [puppet] - 10https://gerrit.wikimedia.org/r/684895 (https://phabricator.wikimedia.org/T280751) (owner: 10Kormat) [12:52:48] (03CR) 10Jbond: [C: 03+2] cfssl-certs: remove short option for dp-config [puppet] - 10https://gerrit.wikimedia.org/r/684894 (owner: 10Jbond) [12:54:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repool db1137 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15718 and previous config saved to /var/cache/conftool/dbconfig/20210504-125439-root.json [12:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:42] !log installing debian-archive-keyring updates on buster [13:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repool db1137 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15719 and previous config saved to /var/cache/conftool/dbconfig/20210504-130943-root.json [13:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:13] (03PS1) 10Jbond: cfssl-cert: only create a db_connection if we need to [puppet] - 10https://gerrit.wikimedia.org/r/684899 [13:11:57] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [13:12:19] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [13:12:35] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:12:35] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:42] (03CR) 10Jbond: [C: 03+2] cfssl-cert: only create a db_connection if we need to [puppet] - 10https://gerrit.wikimedia.org/r/684899 (owner: 10Jbond) [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:49] !log upgrading linux-libc-dev on buster hosts (to version introduced by 10.9 point release) [13:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:07] (03PS2) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [13:15:09] (03PS2) 10Giuseppe Lavagetto: eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 [13:15:21] (03PS1) 10Jbond: cfssl-cert: use read_byts not read_text [puppet] - 10https://gerrit.wikimedia.org/r/684900 [13:16:04] (03CR) 10Jbond: [C: 03+2] cfssl-cert: use read_byts not read_text [puppet] - 10https://gerrit.wikimedia.org/r/684900 (owner: 10Jbond) [13:19:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kafka egress: consume CIDRs from configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/684854 (owner: 10Giuseppe Lavagetto) [13:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repool db1137 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15720 and previous config saved to /var/cache/conftool/dbconfig/20210504-132446-root.json [13:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:21] jouncebot: next [13:27:21] In 2 hour(s) and 32 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1600) [13:27:44] jouncebot: now [13:27:45] No deployments scheduled for the next 2 hour(s) and 32 minute(s) [13:32:00] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [13:34:17] (03PS1) 10Jbond: cfssl-certs: Fix script to use correct akid [puppet] - 10https://gerrit.wikimedia.org/r/684926 (https://phabricator.wikimedia.org/T281366) [13:34:59] (03PS1) 10Hnowlan: api-gateway: route rw traffic to rw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/684929 (https://phabricator.wikimedia.org/T277585) [13:35:08] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [13:35:14] (03CR) 10Jbond: [C: 03+2] cfssl-certs: Fix script to use correct akid [puppet] - 10https://gerrit.wikimedia.org/r/684926 (https://phabricator.wikimedia.org/T281366) (owner: 10Jbond) [13:38:38] (03PS1) 10Jbond: pki: remove old (now revoked) intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/684930 (https://phabricator.wikimedia.org/T281366) [13:39:43] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [13:39:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repool db1137 after mysql upgrade', diff saved to https://phabricator.wikimedia.org/P15721 and previous config saved to /var/cache/conftool/dbconfig/20210504-133950-root.json [13:39:55] (03CR) 10Jbond: [C: 03+2] pki: remove old (now revoked) intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/684930 (https://phabricator.wikimedia.org/T281366) (owner: 10Jbond) [13:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:15] 10SRE, 10Mail, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10MoritzMuehlenhoff) p:05Medium→03High [13:46:43] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-debmonitor_discovery_wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:59] !log installing exim security updates on buster [13:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:30] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocspserve@debmonitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:45] jbond42: ^^^ FYI [13:54:15] volans: ack ill clean up thanks volans [13:54:22] np [13:56:14] (03PS1) 10Jbond: O:pki::root: Add discovery CA [puppet] - 10https://gerrit.wikimedia.org/r/684939 [13:58:18] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:18] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:33] (03CR) 10Jbond: [C: 03+2] O:pki::root: Add discovery CA [puppet] - 10https://gerrit.wikimedia.org/r/684939 (owner: 10Jbond) [14:06:03] (03PS1) 10Jbond: O:pki::multirootca: Add discover CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) [14:06:38] Hi all! [14:07:04] I'm about to do some live debugging on mwdebug1001, together with @Amir1. Should be done in an hour or two. [14:07:34] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: Add discover CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) (owner: 10Jbond) [14:08:06] I'll be messing with files, so please don't scap pull :) [14:11:06] (03PS2) 10Jbond: O:pki::multirootca: Add discover CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) [14:15:06] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Dzahn) Seems like the "define what archived means" is T281778. As close as it gets to the commands that were used on mailman2 for the same thing? [14:15:46] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add s3 to db1102 and s2 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/684873 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [14:16:15] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Dzahn) German Wikinews response is they are asking for an archive of "pressemeldungen". [14:20:56] (03PS3) 10Jbond: O:pki::multirootca: Add discovery CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) [14:23:04] (03CR) 10Jbond: [C: 03+2] O:pki::multirootca: Add discovery CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) (owner: 10Jbond) [14:24:14] (03PS4) 10Jbond: O:pki::multirootca: Add discover CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) [14:25:46] (03PS5) 10Jbond: O:pki::multirootca: Add discovery CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) [14:25:48] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: Add discovery CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) (owner: 10Jbond) [14:26:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29380/console" [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) (owner: 10Jbond) [14:27:28] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:pki::multirootca: Add discovery CA [puppet] - 10https://gerrit.wikimedia.org/r/684946 (https://phabricator.wikimedia.org/T281370) (owner: 10Jbond) [14:32:15] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-debmonitor_discovery_wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:21] looking [14:33:27] (03PS1) 10Jbond: P:pki::multirootca: add dependency when creating ocsp [puppet] - 10https://gerrit.wikimedia.org/r/684957 [14:34:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29381/console" [puppet] - 10https://gerrit.wikimedia.org/r/684957 (owner: 10Jbond) [14:34:51] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-debmonitor_discovery_wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:15] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: add dependency when creating ocsp [puppet] - 10https://gerrit.wikimedia.org/r/684957 (owner: 10Jbond) [14:37:51] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [14:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:35] (03PS1) 10Volans: dnsdisc: do not configure DNS resolver [software/spicerack] - 10https://gerrit.wikimedia.org/r/684958 [14:48:37] (03PS1) 10Volans: tests: fix DNS mock [software/spicerack] - 10https://gerrit.wikimedia.org/r/684959 [14:53:12] PROBLEM - Disk space on lists1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=lists1001&var-datasource=eqiad+prometheus/ops [14:53:52] ^ looking [14:55:06] 10ops-eqiad, 10DC-Ops: hw troubleshooting: for - https://phabricator.wikimedia.org/T281881 (10nskaggs) [14:56:14] 10ops-eqiad, 10DC-Ops: hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10nskaggs) [14:58:00] PROBLEM - Disk space on phab1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops [14:59:14] RECOVERY - Disk space on lists1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=lists1001&var-datasource=eqiad+prometheus/ops [14:59:38] (03CR) 10Volans: [C: 04-1] "Nice! Looks mostly ok, I've one main concern on the hiding of the errors part, see inline for details." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [15:01:22] RECOVERY - Disk space on phab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops [15:02:52] (03CR) 10Jbond: [C: 03+1] "good catch lgtm thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/684958 (owner: 10Volans) [15:06:24] (03PS1) 10David Caro: wmcs: use yaml vs json for k8s objects [cookbooks] - 10https://gerrit.wikimedia.org/r/684964 (https://phabricator.wikimedia.org/T281508) [15:06:26] (03CR) 10Volans: [C: 03+2] dnsdisc: do not configure DNS resolver [software/spicerack] - 10https://gerrit.wikimedia.org/r/684958 (owner: 10Volans) [15:06:56] (03CR) 10Volans: [C: 03+2] tests: fix DNS mock [software/spicerack] - 10https://gerrit.wikimedia.org/r/684959 (owner: 10Volans) [15:07:44] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10nskaggs) [15:09:59] (03PS1) 10Hashar: ci: add docker0 IP to /etc/hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/684965 (https://phabricator.wikimedia.org/T281737) [15:11:29] (03CR) 10jerkins-bot: [V: 04-1] ci: add docker0 IP to /etc/hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/684965 (https://phabricator.wikimedia.org/T281737) (owner: 10Hashar) [15:11:48] (03PS1) 10Jbond: cfssl-certs: add functionality to delete expired certificates [puppet] - 10https://gerrit.wikimedia.org/r/684966 [15:12:30] (03Merged) 10jenkins-bot: dnsdisc: do not configure DNS resolver [software/spicerack] - 10https://gerrit.wikimedia.org/r/684958 (owner: 10Volans) [15:12:32] (03Merged) 10jenkins-bot: tests: fix DNS mock [software/spicerack] - 10https://gerrit.wikimedia.org/r/684959 (owner: 10Volans) [15:12:38] (03CR) 10Jbond: [C: 03+2] cfssl-certs: add functionality to delete expired certificates [puppet] - 10https://gerrit.wikimedia.org/r/684966 (owner: 10Jbond) [15:16:09] Ok, I'm done messing with mwdebug1001. [15:16:38] (03PS1) 10Jbond: cfssl-cert: actully commit the delete [puppet] - 10https://gerrit.wikimedia.org/r/684968 [15:18:36] (03CR) 10Jbond: [C: 03+2] cfssl-cert: actully commit the delete [puppet] - 10https://gerrit.wikimedia.org/r/684968 (owner: 10Jbond) [15:19:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:20:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:23:49] (03CR) 10Ppchelko: [C: 03+1] api-gateway: route rw traffic to rw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/684929 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [15:26:04] (03PS10) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [15:26:16] (03CR) 10Legoktm: Add shellbox chart (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [15:26:49] (03CR) 10Dzahn: [C: 03+2] Add miscweb namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/683743 (owner: 10Dzahn) [15:27:07] (03CR) 10jerkins-bot: [V: 04-1] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [15:29:25] (03PS1) 10Cwhite: profile: restore rsyslog-udp-localhost inputs on legacy logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/684969 (https://phabricator.wikimedia.org/T280805) [15:30:55] (03CR) 10jerkins-bot: [V: 04-1] profile: restore rsyslog-udp-localhost inputs on legacy logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/684969 (https://phabricator.wikimedia.org/T280805) (owner: 10Cwhite) [15:32:21] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.51 [software/spicerack] - 10https://gerrit.wikimedia.org/r/684971 [15:34:13] (03PS2) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) [15:43:00] (03PS2) 10Cwhite: profile: restore rsyslog-udp-localhost inputs on legacy logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/684969 (https://phabricator.wikimedia.org/T280805) [15:43:29] (03CR) 10Bstorm: [C: 03+1] "🎉" [cookbooks] - 10https://gerrit.wikimedia.org/r/684964 (https://phabricator.wikimedia.org/T281508) (owner: 10David Caro) [15:46:07] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.51 [software/spicerack] - 10https://gerrit.wikimedia.org/r/684971 (owner: 10Volans) [15:46:44] 10SRE: Revoke debmonitor.discovery.wmnet - https://phabricator.wikimedia.org/T281366 (10jbond) 05Open→03Resolved [15:46:47] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [15:48:00] (03PS1) 10Jbond: P:pki::multirootca: create a bool to control where cron jobs run [puppet] - 10https://gerrit.wikimedia.org/r/684973 (https://phabricator.wikimedia.org/T281369) [15:48:55] (03PS32) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [15:48:57] (03CR) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [15:49:22] (03PS2) 10Jbond: P:pki::multirootca: create a bool to control where cron jobs run [puppet] - 10https://gerrit.wikimedia.org/r/684973 (https://phabricator.wikimedia.org/T281369) [15:50:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29383/console" [puppet] - 10https://gerrit.wikimedia.org/r/684973 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [15:52:15] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.51 [software/spicerack] - 10https://gerrit.wikimedia.org/r/684971 (owner: 10Volans) [15:54:58] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [15:55:39] (03PS1) 10Volans: Upstream release v0.0.51 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684977 [15:56:59] (03CR) 10Hnowlan: [C: 03+2] api-gateway: route rw traffic to rw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/684929 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [15:58:29] (03Merged) 10jenkins-bot: api-gateway: route rw traffic to rw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/684929 (https://phabricator.wikimedia.org/T277585) (owner: 10Hnowlan) [15:59:16] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:59:17] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jbond42 and cdanis: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1600). [16:02:16] (03PS1) 10Arturo Borrero Gonzalez: hieradata: cloud: refresh mx-out server names [puppet] - 10https://gerrit.wikimedia.org/r/684981 [16:02:26] (03CR) 10jerkins-bot: [V: 04-1] Upstream release v0.0.51 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684977 (owner: 10Volans) [16:02:59] (03CR) 10Volans: "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684977 (owner: 10Volans) [16:03:14] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.51 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684977 (owner: 10Volans) [16:03:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: cloud: refresh mx-out server names [puppet] - 10https://gerrit.wikimedia.org/r/684981 (owner: 10Arturo Borrero Gonzalez) [16:07:42] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:07:42] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:18] (03PS1) 10Bstorm: ceph alerts: fix hardcoded use of a single prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/684983 (https://phabricator.wikimedia.org/T281881) [16:12:00] !log dzahn@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:11] (03Merged) 10jenkins-bot: Upstream release v0.0.51 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/684977 (owner: 10Volans) [16:12:11] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:12:11] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:15] !log dzahn@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:41] !log k8s: upgrading release=namespaces, helmfile apply to create miscweb namespace T281538 [16:13:47] (03PS3) 10Jbond: P:pki::multirootca: create a bool to control where cron jobs run [puppet] - 10https://gerrit.wikimedia.org/r/684973 (https://phabricator.wikimedia.org/T281369) [16:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:49] T281538: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 [16:14:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29384/console" [puppet] - 10https://gerrit.wikimedia.org/r/684973 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [16:15:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: create a bool to control where cron jobs run [puppet] - 10https://gerrit.wikimedia.org/r/684973 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [16:15:39] !log dzahn@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:18] !log dzahn@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:24] jouncebot: next [16:17:24] In 0 hour(s) and 42 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1700) [16:17:35] (03CR) 10David Caro: [C: 03+1] "+1 for changing the CNAME to cloud, though that can be done later." [puppet] - 10https://gerrit.wikimedia.org/r/684983 (https://phabricator.wikimedia.org/T281881) (owner: 10Bstorm) [16:23:58] (03CR) 10Bstorm: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/684983 (https://phabricator.wikimedia.org/T281881) (owner: 10Bstorm) [16:25:02] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@e6ae572]: Increase convert_to_esbulk memory overhead [16:25:10] (03PS1) 10Jbond: cfssl-certs: also clean out ocsp responses [puppet] - 10https://gerrit.wikimedia.org/r/684985 (https://phabricator.wikimedia.org/T281369) [16:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:56] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@e6ae572]: Increase convert_to_esbulk memory overhead (duration: 01m 54s) [16:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:28] (03CR) 10Jbond: [C: 03+2] cfssl-certs: also clean out ocsp responses [puppet] - 10https://gerrit.wikimedia.org/r/684985 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [16:27:39] (03PS33) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [16:30:42] 10SRE: Create a discover CA - https://phabricator.wikimedia.org/T281370 (10jbond) 05Open→03Resolved [16:30:45] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [16:32:45] jouncebot: next [16:32:45] In 0 hour(s) and 27 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1700) [16:35:01] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [16:44:41] (03PS1) 10Urbanecm: Enable Growth features on enwiki in the dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684988 (https://phabricator.wikimedia.org/T281896) [16:45:58] 10SRE, 10Scap, 10Release-Engineering-Team (Next): Re-imaged mw app servers can end up with missing l10n cache for old versions of MW needed for rollback - https://phabricator.wikimedia.org/T273334 (10thcipriani) [16:50:54] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-dns-floating-ip-updater: fix file permission [puppet] - 10https://gerrit.wikimedia.org/r/684989 [16:53:09] (03PS1) 10Bstorm: cloudmetrics: fail over to cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/684990 (https://phabricator.wikimedia.org/T281881) [17:00:04] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1700). [17:00:29] (03PS5) 10Bstorm: maintain-dbusers: rely on the UIDS, not username for all accounts [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) [17:00:37] !log uploaded spicerack_0.0.51 to apt.wikimedia.org bullseye-wikimedia [17:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:54] (03CR) 10Bstorm: [C: 04-1] "still blocked" [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [17:02:59] !log 1.37.0-wmf.4 was branched at f069fd8b5a6c817f4860fa68ae2f56b71a139f4a for T281145 [17:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:07] T281145: 1.37.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T281145 [17:07:34] 10SRE, 10Dumps-Generation, 10observability: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10hoo) [17:09:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [17:09:24] (03PS1) 10Bstorm: wikireplica-dns: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/684999 (https://phabricator.wikimedia.org/T260389) [17:13:14] (03CR) 10Ebernhardson: [C: 03+1] "For the moment, the appropriate course of action seems to be to restore the functionality." [puppet] - 10https://gerrit.wikimedia.org/r/684969 (https://phabricator.wikimedia.org/T280805) (owner: 10Cwhite) [17:13:39] (03PS34) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [17:14:13] (03CR) 10Bstorm: [C: 03+2] wikireplica-dns: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/684999 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:15:25] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685000 [17:15:27] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685000 (owner: 10Ahmon Dancy) [17:16:18] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685000 (owner: 10Ahmon Dancy) [17:16:32] !log dancy@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.4 [17:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:35] 10SRE, 10Dumps-Generation, 10observability: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10dcausse) [17:18:42] (03PS1) 10Zabe: Add extendedconfirmed on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685002 (https://phabricator.wikimedia.org/T281860) [17:20:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudmetrics: fail over to cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/684990 (https://phabricator.wikimedia.org/T281881) (owner: 10Bstorm) [17:22:27] 10SRE, 10Dumps-Generation, 10Wikidata, 10observability, 10wdwb-tech: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10Lydia_Pintscher) [17:23:18] 10SRE, 10Mail, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10Ladsgroup) This has a good comparison: https://mailtrap.io/blog/postfix-sendmail-exim/ It seems postfix is better in security/performance but has lower markets... [17:24:07] PROBLEM - ping-offload grafana alert on alert1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [17:24:39] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [17:24:45] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@0c4538f]: Increase convert_to_esbulk memory overhead [17:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:46] I think I know what the `Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002` alert is about. [17:25:49] I will investigate. [17:26:21] RECOVERY - ping-offload grafana alert on alert1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [17:26:32] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@0c4538f]: Increase convert_to_esbulk memory overhead (duration: 01m 46s) [17:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:03] (03PS3) 10Dzahn: ci/deployment-server: adding kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 (https://phabricator.wikimedia.org/T281538) [17:27:18] (03CR) 10jerkins-bot: [V: 04-1] ci/deployment-server: adding kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [17:28:55] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [17:32:22] (03PS2) 10Luke081515: Enable Wikidata description override on dewiki at beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682337 (https://phabricator.wikimedia.org/T279829) [17:32:44] 10SRE, 10Mail, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10herron) Some high level thoughts about how we might approach migrating: **Inbound mail:** As a first step in migrating to postfix we could front the existing e... [17:33:05] (03PS4) 10Dzahn: ci/deployment-server: adding kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 (https://phabricator.wikimedia.org/T281538) [17:33:55] (03CR) 10jerkins-bot: [V: 04-1] ci/deployment-server: adding kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [17:35:57] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [17:36:41] ^ manual editing? [17:36:44] self-healed? I don't know what happened. [17:37:11] It would be nice if the alert mentioned at least one of the offending files. [17:37:34] notable that it is codfw and not eqiad [17:37:50] aah, I did notice "codfw" in passing but it didn't fully register in my mind [17:38:46] I checked but it's too late, it's already fixed [17:39:00] no root-onwed files now in staging [17:39:35] (03PS1) 10Bstorm: wikireplica-dns: Add the outlier CNAMES and correct fqdn [puppet] - 10https://gerrit.wikimedia.org/r/685012 (https://phabricator.wikimedia.org/T260389) [17:40:56] dancy: " somebody ran scripts as root when they should have used a deployment user. " ? [17:41:24] I guess we'll never know unless someone fesses up. [17:43:29] if it happens just once and not next time it's like it didnt happen [17:43:41] nod [17:45:51] (03CR) 10Majavah: [C: 03+1] "lgtm based on a quick look, but have no way of testing" [puppet] - 10https://gerrit.wikimedia.org/r/685012 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:46:31] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/685012 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:47:44] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) Mailing lists that have a wild ban on all addresses: ` b"bibliowiki: ban_list = ['^.*$', ]" b"board-nominations: ban_list = ['^.*$']" b"discovery-private: ba... [17:48:15] (03CR) 10Bstorm: [C: 03+2] wikireplica-dns: Add the outlier CNAMES and correct fqdn [puppet] - 10https://gerrit.wikimedia.org/r/685012 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:49:04] (03PS5) 10Dzahn: ci/deployment-server: adding kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 (https://phabricator.wikimedia.org/T281538) [17:49:27] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [17:54:53] (03PS1) 10Jbond: P:pki::multirootca: Add timer to clean expired certificates [puppet] - 10https://gerrit.wikimedia.org/r/685026 (https://phabricator.wikimedia.org/T281369) [17:56:19] (03PS2) 10Jbond: P:pki::multirootca: Add timer to clean expired certificates [puppet] - 10https://gerrit.wikimedia.org/r/685026 (https://phabricator.wikimedia.org/T281369) [17:58:35] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10RobH) [17:58:38] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29387/" [puppet] - 10https://gerrit.wikimedia.org/r/685026 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [17:58:55] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.4 (duration: 42m 33s) [17:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1800) [18:26:07] poor wikibugs [18:26:25] I have got used to watching its output vs email [18:32:53] sukhe: I kicked it, should reconnect when it has a message for this channel [18:34:34] Majavah: :) [18:35:31] (03PS1) 10Zabe: Avoid using User::getGroups() and ::getGroupMemberships() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685032 (https://phabricator.wikimedia.org/T281823) [18:37:07] (03CR) 10Ssingh: "PCC looks good but I am looking for a review on the approach as well. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/685030 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:37:24] (03CR) 10Cwhite: [C: 03+2] "PCC checks out https://puppet-compiler.wmflabs.org/compiler1001/29390/" [puppet] - 10https://gerrit.wikimedia.org/r/684969 (https://phabricator.wikimedia.org/T280805) (owner: 10Cwhite) [18:37:45] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) ` hank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juniper has received the following defective part at our return... [18:48:02] (03PS3) 10Zabe: Avoid using User::getGroups() and ::getEffectiveGroups() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685032 (https://phabricator.wikimedia.org/T281823) [18:56:16] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685032 (https://phabricator.wikimedia.org/T281823) (owner: 10Zabe) [19:00:04] brennen and liw: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1900). [19:17:16] (03PS1) 10Razzi: swap: remove references to profile::swap [puppet] - 10https://gerrit.wikimedia.org/r/685066 (https://phabricator.wikimedia.org/T281917) [19:26:14] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29391/console" [puppet] - 10https://gerrit.wikimedia.org/r/685066 (https://phabricator.wikimedia.org/T281917) (owner: 10Razzi) [19:26:33] (03CR) 10Razzi: [V: 03+1 C: 03+2] swap: remove references to profile::swap [puppet] - 10https://gerrit.wikimedia.org/r/685066 (https://phabricator.wikimedia.org/T281917) (owner: 10Razzi) [19:29:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10wiki_willy) a:03Jclark-ctr [19:36:31] (03PS1) 10Ahmon Dancy: group0 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685070 [19:36:33] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685070 (owner: 10Ahmon Dancy) [19:37:25] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685070 (owner: 10Ahmon Dancy) [19:38:48] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.4 [19:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:32] (03PS1) 10CDanis: Revert "Ratelimit applebot temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/685073 [19:46:15] !log joal@deploy1002 Started deploy [analytics/refinery@0dc3ae7]: Regular analytics weekly train [analytics/refinery@0dc3ae7] [19:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:31] RECOVERY - Host elastic2033 is UP: PING OK - Packet loss = 0%, RTA = 33.05 ms [19:55:52] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10Papaul) a:05Papaul→03elukey @elukey the server is back up. All yours [19:55:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:55:57] RECOVERY - Elasticsearch HTTPS for production-search-codfw on elastic2033 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 839 days) https://wikitech.wikimedia.org/wiki/Search [19:56:18] (03CR) 10Hashar: [C: 04-1] "On hold for now cause I really dislike this hack :]" [puppet] - 10https://gerrit.wikimedia.org/r/684965 (https://phabricator.wikimedia.org/T281737) (owner: 10Hashar) [19:56:27] RECOVERY - Elasticsearch HTTPS for production-search-psi-codfw on elastic2033 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 839 days) https://wikitech.wikimedia.org/wiki/Search [19:56:59] RECOVERY - SSH on elastic2033 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:57:11] PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:56] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10Papaul) ` pt1979@elastic2033:~$ cat /proc/mdstat Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] md1 : acti... [19:58:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:58:43] (03CR) 10CDanis: [C: 03+2] Revert "Ratelimit applebot temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/685073 (owner: 10CDanis) [20:03:30] !log joal@deploy1002 Finished deploy [analytics/refinery@0dc3ae7]: Regular analytics weekly train [analytics/refinery@0dc3ae7] (duration: 17m 15s) [20:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:41] !log joal@deploy1002 Started deploy [analytics/refinery@0dc3ae7] (thin): Regular analytics weekly train THIN [analytics/refinery@0dc3ae7] [20:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:49] !log joal@deploy1002 Finished deploy [analytics/refinery@0dc3ae7] (thin): Regular analytics weekly train THIN [analytics/refinery@0dc3ae7] (duration: 00m 07s) [20:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:12] !log joal@deploy1002 Started deploy [analytics/refinery@0dc3ae7] (hadoop-test): Regular analytics weekly train HADOOP-TEST [analytics/refinery@0dc3ae7] [20:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:59] jouncebot: now [20:05:59] For the next 0 hour(s) and 54 minute(s): MediaWiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T1900) [20:07:09] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [20:08:06] (03PS1) 10Herron: icinga: move hosts icinga[12]001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) [20:08:12] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [20:09:11] (03PS2) 10Herron: icinga: move hosts icinga[12]001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) [20:09:29] !log joal@deploy1002 Finished deploy [analytics/refinery@0dc3ae7] (hadoop-test): Regular analytics weekly train HADOOP-TEST [analytics/refinery@0dc3ae7] (duration: 05m 16s) [20:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:48] (03PS3) 10Herron: icinga: move hosts icinga[12]001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) [20:10:29] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [20:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:41] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:53] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission conf200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T281374 (10Papaul) [20:14:14] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10Papaul) [20:14:46] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission conf200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T281374 (10Papaul) 05Open→03Resolved Complete [20:18:21] PROBLEM - Stale file for node-exporter textfile in codfw on alert1001 is CRITICAL: cluster=elasticsearch file=device_smart.prom instance=elastic2033 job=node site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [20:21:26] (03PS1) 10Herron: logstash101[012]: prep for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) [20:32:50] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): labstore1007 crashed after storage controller errors--replace disk? - https://phabricator.wikimedia.org/T281045 (10Jclark-ctr) @wiki_willy @Bstorm host is out of warranty expired Jun 1, 2020. We do not have any 6tb hard drives how would you like to proce... [20:39:49] RECOVERY - Stale file for node-exporter textfile in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [20:49:05] 10SRE, 10Abstract Wikipedia team, 10DNS, 10Traffic, 10Patch-For-Review: Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [21:11:06] (03Abandoned) 10Cwhite: logstash: disable the dlq [puppet] - 10https://gerrit.wikimedia.org/r/672559 (owner: 10Cwhite) [21:13:49] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@06a4a3e]: Bump glent to 0.2.4 [21:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:44] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@06a4a3e]: Bump glent to 0.2.4 (duration: 03m 55s) [21:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:16] (03PS2) 10Andrew Bogott: wmcs-policy-tests.py: add Trove policy tests [puppet] - 10https://gerrit.wikimedia.org/r/684494 (https://phabricator.wikimedia.org/T279845) [21:23:18] (03PS1) 10Andrew Bogott: trove.conf: use our custom policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/685105 (https://phabricator.wikimedia.org/T281655) [21:25:13] (03CR) 10Dzahn: [C: 03+2] ci/deployment-server: adding kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:25:19] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-policy-tests.py: add Trove policy tests [puppet] - 10https://gerrit.wikimedia.org/r/684494 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [21:26:15] andrewbogott: should I type the "multiple"? [21:26:29] yes please [21:26:46] ACK, done [21:28:46] (03CR) 10Andrew Bogott: [C: 03+2] trove.conf: use our custom policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/685105 (https://phabricator.wikimedia.org/T281655) (owner: 10Andrew Bogott) [21:32:33] (03PS1) 10Andrew Bogott: mwopenstackclients: add Trove support in python3 [puppet] - 10https://gerrit.wikimedia.org/r/685108 [21:35:35] (03PS1) 10Bstorm: wikireplica-dns: Fix up the outlier dbs [puppet] - 10https://gerrit.wikimedia.org/r/685109 (https://phabricator.wikimedia.org/T260389) [21:40:33] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: add Trove support in python3 [puppet] - 10https://gerrit.wikimedia.org/r/685108 (owner: 10Andrew Bogott) [21:47:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:47:33] (03PS1) 10Dzahn: deployment_server/k8s: fix a syntax issue introduced in rebasing [puppet] - 10https://gerrit.wikimedia.org/r/685116 (https://phabricator.wikimedia.org/T281538) [21:49:00] (03CR) 10Dzahn: [C: 03+2] deployment_server/k8s: fix a syntax issue introduced in rebasing [puppet] - 10https://gerrit.wikimedia.org/r/685116 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:49:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:50:16] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/685116" [puppet] - 10https://gerrit.wikimedia.org/r/681500 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:56:44] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10Dzahn) Thanks @Papaul. This will continue in T280597. [21:57:11] (03CR) 10BBlack: [C: 03+2] [noop] remove eqiad upload storage override [puppet] - 10https://gerrit.wikimedia.org/r/683025 (owner: 10BBlack) [21:57:43] (03CR) 10Bstorm: [C: 03+2] wikireplica-dns: Fix up the outlier dbs [puppet] - 10https://gerrit.wikimedia.org/r/685109 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [22:01:30] (03PS1) 10Zabe: Add extendedconfirmed on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685119 (https://phabricator.wikimedia.org/T281926) [22:06:47] 10SRE, 10serviceops, 10Patch-For-Review: Put rdb20[09|10] into service - https://phabricator.wikimedia.org/T281225 (10Dzahn) p:05Triage→03Medium [22:08:02] (03PS5) 10BBlack: Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T275046) [22:10:30] (03CR) 10BBlack: [C: 03+2] Puppetize cp501[3456] [puppet] - 10https://gerrit.wikimedia.org/r/683026 (https://phabricator.wikimedia.org/T275046) (owner: 10BBlack) [22:11:07] 10SRE, 10Services, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Dzahn) p:05Triage→03High [22:14:20] 10SRE, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10Dzahn) p:05Triage→03Medium [22:16:54] 10SRE, 10Traffic, 10Patch-For-Review: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp5013.eqsin.wmnet', 'cp5014.eqsin.wmnet', 'cp5015.eqsin.wmnet', 'cp5016.eqs... [22:17:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:18:36] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Jclark-ctr) nic shows link. card was installed previously unsure if needs to be updated not sure why it says intel? packing slip for nic is Broadcom 57412 2 Port 10Gb SFP... [22:19:23] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:24:20] (03PS1) 10Dzahn: phabricator: add phab2002 to list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/685130 (https://phabricator.wikimedia.org/T280597) [22:26:48] (03CR) 10Dzahn: [C: 03+2] phabricator: add phab2002 to list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/685130 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:42:13] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [22:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:28] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [22:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:43] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [22:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:07] (03PS1) 10Dzahn: conftool-data: add phab2002 to codfw git-ssh pool [puppet] - 10https://gerrit.wikimedia.org/r/685132 (https://phabricator.wikimedia.org/T280597) [22:44:27] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [22:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:46] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [22:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:28] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [22:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:47:43] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [22:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:46] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [22:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:59] (03PS1) 10Dzahn: site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) [22:53:20] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5016 is CRITICAL: connect to address 10.132.0.9 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [22:53:20] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS acme-chief- on cp5016 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.9: Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [22:53:20] PROBLEM - check_trafficserver_log_fifo_analytics_tls on cp5016 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.9: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:53:58] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:56:26] manual-downtimed cp501[56] so they don't spam more alets as they do initial puppetization [22:58:16] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210504T2300). Please do the needful. [23:00:04] Zabe: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:25] i can deploy today [23:00:27] o/ [23:01:16] (03CR) 10Urbanecm: [C: 03+2] Add extendedconfirmed on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685002 (https://phabricator.wikimedia.org/T281860) (owner: 10Zabe) [23:02:58] (03Merged) 10jenkins-bot: Add extendedconfirmed on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685002 (https://phabricator.wikimedia.org/T281860) (owner: 10Zabe) [23:04:02] Zabe can you test it on mwdebug1001 please? [23:04:55] Urbanecm: works the supposed way [23:05:03] cool, syncing [23:06:55] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 012d6138741ea76c985453428111aeddfdec2271: Add extendedconfirmed on azwiki (T281860) (duration: 01m 10s) [23:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:05] T281860: Creating "extended confirmed user" right and "extended confirmed protection" system in Azwiki. - https://phabricator.wikimedia.org/T281860 [23:07:12] (03PS2) 10Urbanecm: Add extendedconfirmed on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685119 (https://phabricator.wikimedia.org/T281926) (owner: 10Zabe) [23:07:16] 10SRE, 10Mail, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10Legoktm) >>! In T232343#7058654, @herron wrote: > **Lists:** Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front t... [23:07:56] (03CR) 10Urbanecm: [C: 03+2] Add extendedconfirmed on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685119 (https://phabricator.wikimedia.org/T281926) (owner: 10Zabe) [23:08:24] Zabe: merging the second one, but for future commits, I'd prefer if this was split into two commits: one adding the wiki, second one cleaning up. It's simpler to review for me [23:08:59] ok, gonna do that the next time [23:09:25] (03Merged) 10jenkins-bot: Add extendedconfirmed on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685119 (https://phabricator.wikimedia.org/T281926) (owner: 10Zabe) [23:09:39] thanks :) [23:09:54] Zabe: pulled onto mwdebug1001, please test [23:10:05] 10SRE, 10Mail, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10Legoktm) [23:11:05] Urbanecm: works the supposed way [23:12:56] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:13:10] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e467d92e5e257a3d2f9b05692db9accdd86ddb00: Add extendedconfirmed on ptwiki (T281926) (duration: 01m 10s) [23:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:19] T281926: Set "extended confirmed" protection level/user group for ptwiki - https://phabricator.wikimedia.org/T281926 [23:13:30] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Legoktm) >>! In T280322#7058759, @Ladsgroup wrote: > ` > b"mediation-en-l: ban_list = ['^.*@.*']" > ` I disabled this one with `disable_list` yesterday, I wonder why it doesn't... [23:15:03] (03PS4) 10Urbanecm: Avoid using User::getGroups() and ::getEffectiveGroups() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685032 (https://phabricator.wikimedia.org/T281823) (owner: 10Zabe) [23:15:05] (03PS5) 10Urbanecm: Avoid using User::getGroups() and ::getEffectiveGroups() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685032 (https://phabricator.wikimedia.org/T281823) (owner: 10Zabe) [23:15:20] PS4 are minor formatting changes, PS5 is a rebase [23:15:20] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:16:19] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) this server isn't in use currently, you're welcome to reboot it or shut it down as part of troubleshooting. [23:18:44] (03CR) 10Urbanecm: [C: 03+2] Avoid using User::getGroups() and ::getEffectiveGroups() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685032 (https://phabricator.wikimedia.org/T281823) (owner: 10Zabe) [23:19:17] (03Merged) 10jenkins-bot: Avoid using User::getGroups() and ::getEffectiveGroups() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685032 (https://phabricator.wikimedia.org/T281823) (owner: 10Zabe) [23:19:41] I'm going to just sync this, as it cannot be actually tested [23:21:18] (03PS2) 10Urbanecm: Enable Growth team features in dark mode on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684851 (https://phabricator.wikimedia.org/T280824) [23:21:33] (03CR) 10Urbanecm: [C: 03+2] Enable Growth team features in dark mode on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684851 (https://phabricator.wikimedia.org/T280824) (owner: 10Urbanecm) [23:22:15] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: a3c24f322b754c9a94c260ee5df4b5ae4de27f22: Avoid using User::getGroups() and ::getEffectiveGroups() (T281823) (duration: 01m 10s) [23:22:21] Zabe: all done :) [23:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:24] T281823: operations/mediawiki-config - hard deprecate User group methods - https://phabricator.wikimedia.org/T281823 [23:22:34] Urbanecm: thanks for your help :) [23:22:41] any time [23:22:54] (03Merged) 10jenkins-bot: Enable Growth team features in dark mode on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684851 (https://phabricator.wikimedia.org/T280824) (owner: 10Urbanecm) [23:24:46] !log Create tables for GrowthExperiments extension on bgwiki (T280824) [23:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:54] T280824: Deploy Growth features on Bulgarian Wikipedia - https://phabricator.wikimedia.org/T280824 [23:25:19] (03PS2) 10Urbanecm: Enable Growth features on enwiki in the dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684988 (https://phabricator.wikimedia.org/T281896) [23:25:23] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on enwiki in the dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684988 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [23:26:31] !log Create tables for GrowthExperiments extension on enwiki (T281896) [23:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:39] T281896: Deploy Growth features on English Wikipedia - https://phabricator.wikimedia.org/T281896 [23:27:21] (03Merged) 10jenkins-bot: Enable Growth features on enwiki in the dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684988 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [23:28:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5b4c516a1d0461065e27cacec5d2b1cb315a2c07: Enable Growth team features in dark mode on bgwiki (T280824; 1/3) (duration: 01m 09s) [23:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:17] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 5b4c516a1d0461065e27cacec5d2b1cb315a2c07: Enable Growth team features in dark mode on bgwiki (T280824; 2/3) (duration: 01m 09s) [23:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:26] T280824: Deploy Growth features on Bulgarian Wikipedia - https://phabricator.wikimedia.org/T280824 [23:30:37] !log urbanecm@deploy1002 sync-file aborted: 5b4c516a1d0461065e27cacec5d2b1cb315a2c07: Enable Growth team features in dark mode on bgwiki (T280824; 3/3) (duration: 00m 03s) [23:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:29] (03PS1) 10Dzahn: httpbb: update tests for annual.wikimedia.org to 2020/21 [puppet] - 10https://gerrit.wikimedia.org/r/685139 [23:31:54] !log urbanecm@deploy1002 Synchronized wmf-config/config/bgwiki.yaml: 5b4c516a1d0461065e27cacec5d2b1cb315a2c07: Enable Growth team features in dark mode on bgwiki (T280824; 3/3) (duration: 01m 09s) [23:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:25] (03PS1) 10Urbanecm: Growth features: enwiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685140 (https://phabricator.wikimedia.org/T281896) [23:33:27] (03CR) 10Urbanecm: [C: 03+2] Growth features: enwiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685140 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [23:33:40] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS acme-chief- on cp5016 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [23:35:13] (03Merged) 10jenkins-bot: Growth features: enwiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685140 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [23:38:15] (03CR) 10Dzahn: [C: 03+2] httpbb: update tests for annual.wikimedia.org to 2020/21 [puppet] - 10https://gerrit.wikimedia.org/r/685139 (owner: 10Dzahn) [23:38:36] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [23:38:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d29dbb2f435afe64f2fee15b430ee04d5d13c8d7: Enable Growth features on enwiki in the dark mode (T281896; 1/3) (duration: 01m 09s) [23:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:50] T281896: Deploy Growth features on English Wikipedia - https://phabricator.wikimedia.org/T281896 [23:39:16] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:30] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:40:16] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: d29dbb2f435afe64f2fee15b430ee04d5d13c8d7: Enable Growth features on enwiki in the dark mode (T281896; 2/3) (duration: 01m 09s) [23:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:18] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5015.eqsin.wmnet', 'cp5013.eqsin.wmnet', 'cp5014.eqsin.wmnet', 'cp5016.eqsin.wmnet'] ` and were **ALL** successful. [23:41:37] !log urbanecm@deploy1002 Synchronized wmf-config/config/enwiki.yaml: d29dbb2f435afe64f2fee15b430ee04d5d13c8d7: Enable Growth features on enwiki in the dark mode (T281896; 3/3) (duration: 01m 09s) [23:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:56] * Urbanecm done [23:45:09] I missed out [23:45:20] still have a config change in the queue [23:48:01] mutante: do you want me to deploy it? [23:51:12] Urbanecm: hmm.. yes :) [23:51:29] sure :). Can you link it please? [23:51:36] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681766 [23:51:43] it's the fc-list [23:51:57] it's been outdated for 3.5 years [23:53:55] https://en.wikipedia.org/wiki/Fontconfig#Utilities [23:54:02] (03CR) 10Urbanecm: [C: 03+2] update fc-list to current version on buster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [23:54:22] mutante: thanks for fixing a RT-era ticket it seems :) [23:54:47] Urbanecm: :) there is more to it in the pipeline.. like a timer to keep it current [23:54:52] but for now this [23:55:01] thank you too [23:55:13] no problem :) [23:55:45] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 import script is unnecessarily truncating list descriptions - https://phabricator.wikimedia.org/T281933 (10Ladsgroup) One thing I was thinking was that we can simply set the description from the old file (similar to how we handle templates), post-creation if that's ea... [23:55:47] "for the record, I replaced this list with one generated on thumbor1002 instead of mw2300 but the fc-list content stayed the same" [23:56:10] ^ it matters on thumbor servers because SVG [23:57:06] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) Yeah, it uses a week-old config file. I can update it. For now, I want ignore any mailing list in that list. [23:57:26] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Legoktm) The emails are coming from Mailman2... ` The HelpDesk-l@lists.wikimedia.org mailing list has 4 request(s) waiting for your consideration at: https://lis... [23:58:03] (03PS1) 10Urbanecm: Growth: enwiki: Add list of mentors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685143 (https://phabricator.wikimedia.org/T281896) [23:58:17] (03Merged) 10jenkins-bot: update fc-list to current version on buster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [23:59:15] 10SRE, 10Patch-For-Review: update svg font list - https://phabricator.wikimedia.org/T79424 (10Dzahn) @kaldari @JoKalliauer An updated fc-list (as it looks on a thumbor server nowadays) has been deployed. T280718#7034890 has more for later to keep it from getting outdated that much again [23:59:41] mutante: actually I'm just running the sync. [23:59:48] should be live soon (within a minute) through