[00:00:03] fyi: UTC late backport window is cancelled [00:00:24] guess I'll bounce it [00:01:50] oh wait I'm sshed into it just fine?? not sure what's up, poking around a little [00:02:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:19] there should be some logs on centrallog1001 now I believe [00:05:43] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [00:07:29] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:08:23] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:45] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: product-analytics-movement-metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:27] !log rzl@graphite1004:~$ sudo shutdown -r now T297265 [00:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:33] T297265: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 [00:15:45] PROBLEM - graphite eqiad port 443/tcp - Graphite metrics platform IPv4 on graphite1004 is CRITICAL: connect to address 10.64.16.149 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Graphite [00:16:17] PROBLEM - graphite.wikimedia.org requires authentication on graphite1004 is CRITICAL: connect to address 10.64.16.149 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [00:17:01] !log deployed updated patches for T297322 [00:17:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:21] PROBLEM - SSH on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:21:08] PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:37] paged for graphite, I was hoping to avert that :/ [00:21:49] acked in VO [00:21:51] apparently when I rebooted it, it went down but didn't come back [00:21:52] thanks [00:22:00] try to bring it up via mgmt? [00:22:00] will try again via mgmt [00:22:03] :) [00:23:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:07] !log graphite1004.mgmt: /admin1-> racadm serveraction powercycle (T297265) [00:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:12] T297265: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 [00:27:38] hey, in a parking lot at the moment but we should have netconsole logs now in syslog on centrallog1001 for graphite1004 [00:28:08] RECOVERY - Host graphite1004 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [00:28:23] PROBLEM - carbon-cache@e service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:28:25] PROBLEM - carbon-frontend-relay service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:28:25] RECOVERY - SSH on graphite1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:28:27] PROBLEM - carbon-local-relay service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:28:43] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [00:29:15] RECOVERY - graphite eqiad port 443/tcp - Graphite metrics platform IPv4 on graphite1004 is OK: OK - Certificate graphite.discovery.wmnet will expire on Mon 10 Feb 2025 02:49:56 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Graphite [00:29:49] RECOVERY - graphite.wikimedia.org requires authentication on graphite1004 is OK: OK - Certificate graphite.discovery.wmnet will expire on Mon 10 Feb 2025 02:49:56 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [00:29:59] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:29:59] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1594 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [00:30:05] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 92.64% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:30:31] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:39] RECOVERY - carbon-cache@e service on graphite1004 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:30:39] RECOVERY - carbon-frontend-relay service on graphite1004 is OK: OK - carbon-frontend-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:30:41] RECOVERY - carbon-local-relay service on graphite1004 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:40:01] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10colewhite) ` $ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Jun-01-2018 | 10:56:02 | SEL... [00:46:27] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:57:39] PROBLEM - puppet last run on gitlab2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T0100). [01:09:55] RECOVERY - puppet last run on gitlab2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:17:20] 10SRE, 10Scap, 10Release-Engineering-Team (Radar): mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10Legoktm) 05Open→03Resolved I've scapped once or twice recently and didn't notice it being egregiously slow. We can call... [01:18:55] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:19:06] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10Legoktm) >>! In T297265#7558112, @colewhite wrote: > Did we replace the memory after the events on Sept 26th? I cannot find any Phabricator tickets mention... [01:21:59] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcephmon1003, cloudcephmon1002, cp5006, cloudcephmon1001, gitlab2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [02:10:35] (03PS1) 10Cwhite: discovery: move read traffic to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/745352 (https://phabricator.wikimedia.org/T297265) [02:12:38] (03PS1) 10Cwhite: wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/745353 [02:13:14] (03PS2) 10Cwhite: wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/745353 (https://phabricator.wikimedia.org/T297265) [02:15:37] (03PS1) 10Cwhite: profile: move statsd writes to graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/745354 (https://phabricator.wikimedia.org/T297265) [02:18:44] (03PS1) 10Cwhite: ProductionServices: use graphite2003 for statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745355 (https://phabricator.wikimedia.org/T297265) [02:28:17] (03PS1) 10Cwhite: graphite: check graphite2003 metrics [puppet] - 10https://gerrit.wikimedia.org/r/745356 (https://phabricator.wikimedia.org/T297265) [02:34:38] (03CR) 10Cwhite: [C: 03+2] discovery: move read traffic to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/745352 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [02:36:12] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Legoktm) I don't think so, I just wanted someone else to +1 the request and I can grant the permissions myself. [02:39:21] (03CR) 10Cwhite: [C: 03+2] graphite: check graphite2003 metrics [puppet] - 10https://gerrit.wikimedia.org/r/745356 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [02:39:29] (03PS2) 10Cwhite: graphite: check graphite2003 metrics [puppet] - 10https://gerrit.wikimedia.org/r/745356 (https://phabricator.wikimedia.org/T297265) [02:40:14] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10herron) It would be good to rule out the memory, although the timestamps in the SEL and hangs don't line up closely. FWIW the host was reimaged on 10/23 to... [02:52:00] (03CR) 10Herron: [C: 03+1] profile: move statsd writes to graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/745354 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [02:52:29] (03CR) 10Herron: [C: 03+1] wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/745353 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [02:53:17] (03CR) 10Cwhite: [C: 03+2] wmnet: move writes to graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/745353 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [02:54:18] !log failover statsd ingest host to graphite2003 T297265 [02:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:24] T297265: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 [02:54:50] (03CR) 10Cwhite: [C: 03+2] profile: move statsd writes to graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/745354 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [02:59:44] (03CR) 10Herron: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745355 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [03:17:55] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:21:09] (03CR) 10Cwhite: [C: 03+2] ProductionServices: use graphite2003 for statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745355 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [03:21:51] (03Merged) 10jenkins-bot: ProductionServices: use graphite2003 for statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745355 (https://phabricator.wikimedia.org/T297265) (owner: 10Cwhite) [03:27:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:59] !log cwhite@deploy1002 Synchronized wmf-config/ProductionServices.php: fail over statsd to graphite2003 T297265 (duration: 01m 05s) [03:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:04] T297265: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 [03:34:47] !log bounce navtiming on webperf1001 to pick up statsd changes T297265 [03:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:10] !log bounce superset on an-tool1010 and 1005 to pick up statsd changes T247963 [03:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:14] T247963: Migrate role::graphite::production to Bullseye - https://phabricator.wikimedia.org/T247963 [04:03:44] (03PS1) 10Herron: icinga: make graphite2003 paging [puppet] - 10https://gerrit.wikimedia.org/r/745359 (https://phabricator.wikimedia.org/T297265) [04:04:23] (03CR) 10Cwhite: [C: 03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/745359 (https://phabricator.wikimedia.org/T297265) (owner: 10Herron) [04:04:41] (03CR) 10Herron: [C: 03+2] icinga: make graphite2003 paging [puppet] - 10https://gerrit.wikimedia.org/r/745359 (https://phabricator.wikimedia.org/T297265) (owner: 10Herron) [04:09:36] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10colewhite) We are failed over to graphite2003 for now. [04:09:54] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10colewhite) p:05Unbreak!→03High [04:18:01] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:34:13] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:20:03] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 72 probes of 639 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:25:39] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:26:17] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 639 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:37:06] 10SRE, 10Infrastructure-Foundations, 10Mail: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Aklapper) [07:39:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 99 probes of 631 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:43:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 97 probes of 639 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:46:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 56 probes of 631 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:49:21] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 639 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:03:53] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:06:07] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:12:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2020.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [08:13:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2020.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [08:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2020.codfw.wmnet with OS buster [08:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:22] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS buster [08:31:36] (03PS1) 10Jcrespo: dbbackups: Add x1 to db1116 backup source host [puppet] - 10https://gerrit.wikimedia.org/r/745464 (https://phabricator.wikimedia.org/T296546) [08:32:23] PROBLEM - Disk space on ml-etcd2002 is CRITICAL: DISK CRITICAL - free space: / 702 MB (3% inode=95%): /tmp 702 MB (3% inode=95%): /var/tmp 702 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops [08:32:41] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add x1 to db1116 backup source host [puppet] - 10https://gerrit.wikimedia.org/r/745464 (https://phabricator.wikimedia.org/T296546) (owner: 10Jcrespo) [08:36:32] checking ml-etcd [08:37:25] ok so ml-etcd2003 is down due to drdb changes https://sal.toolforge.org/production?p=0&q=ml-etcd&d= [08:37:33] 2002 is probably filling up with logs [08:39:11] elukey: I thought I had powerred that back on? let me check [08:40:31] moritzm: it seems down from icinga's perspective :( [08:40:51] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms [08:41:02] fixed [08:42:10] I couldn't actually switch it to DRBD since the row A cluster in codfw was too full to get another free secondary instance and seems I forgot to power it back on after debugging [08:42:50] but the state of those was weird to begin with only ml-etcd2001 and 2003 were using plain disk storage, 2002 was already using DRBD [08:43:32] elukey: as such I'm inclined to assume there's no real issue with running those in the default DRBD setting and we should probably keep them that way when the ganeti update is over? [08:44:08] less special cases, actual redundancy [08:44:19] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) Thank you folks for taking care of this! >>! In T297265#7556849, @Legoktm wrote: > @fgiunchedi two other things that it would be good to have y... [08:44:27] moritzm: we can definitely try, it will be clear from latencies if this is not the way to go. Will it be a problem, in case, to switch it to plain if we notice some latency regression in the future? [08:45:07] we are not really using the codfw cluster at the moment, I suspect that the real issues may come up when we'll route real user traffic to it [08:45:35] (03PS1) 10Jcrespo: dbbackups: Add s2 to db1139 (backup source) [puppet] - 10https://gerrit.wikimedia.org/r/745466 (https://phabricator.wikimedia.org/T296546) [08:46:18] ack, sounds good [08:46:39] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) The temporary netconsole client on graphite1004 paid off, see https://phabricator.wikimedia.org/P18076 for logs from the host (`journalctl -u n... [08:50:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32914/console" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [08:52:30] RECOVERY - Disk space on ml-etcd2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops [08:53:40] (03PS1) 10Jcrespo: dbbackups: Switchover db backup generation away from db1102 [puppet] - 10https://gerrit.wikimedia.org/r/745469 (https://phabricator.wikimedia.org/T296546) [08:54:18] (03PS2) 10Jcrespo: dbbackups: Switchover db backup generation away from db1102 [puppet] - 10https://gerrit.wikimedia.org/r/745469 (https://phabricator.wikimedia.org/T296546) [08:57:16] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/745284 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [08:57:47] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switchover db backup generation away from db1102 [puppet] - 10https://gerrit.wikimedia.org/r/745469 (https://phabricator.wikimedia.org/T296546) (owner: 10Jcrespo) [09:07:27] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add s2 to db1139 (backup source) [puppet] - 10https://gerrit.wikimedia.org/r/745466 (https://phabricator.wikimedia.org/T296546) (owner: 10Jcrespo) [09:10:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2020.codfw.wmnet with OS buster [09:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:36] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS buster completed: - ganeti2020 (**PASS**) - Downtimed on Icinga... [09:12:50] (03CR) 10Ema: [C: 03+1] "One nit but not a blocker!" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [09:14:28] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add exim4 blackhole configuration [puppet] - 10https://gerrit.wikimedia.org/r/743207 (https://phabricator.wikimedia.org/T296373) (owner: 10Filippo Giunchedi) [09:14:40] (03PS1) 10Kosta Harlan: CacheDecorator: Bump cache version [extensions/GrowthExperiments] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745371 (https://phabricator.wikimedia.org/T297248) [09:17:29] (03CR) 10Ayounsi: netconsole: refactor targets lookup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [09:20:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [09:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [09:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:55] (03PS10) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [09:25:01] (03CR) 10Hashar: [C: 03+1] "The code comes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/154488 in 2014. It refers to integration.wmflabs.org which was a " [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [09:28:53] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (https://phabricator.wikimedia.org/T243065) (owner: 10Filippo Giunchedi) [09:29:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32915/console" [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:32:21] (03PS11) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [09:32:48] (03PS5) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) [09:33:12] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [09:34:38] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:54] (03PS1) 10Arturo Borrero Gonzalez: hiera: ceph: add mgr keyrings placeholders [labs/private] - 10https://gerrit.wikimedia.org/r/745477 (https://phabricator.wikimedia.org/T293752) [09:35:12] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: ceph: add mgr keyrings placeholders [labs/private] - 10https://gerrit.wikimedia.org/r/745477 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:43:54] (03PS1) 10Arturo Borrero Gonzalez: hiera: ceph: add dummy caps for mgr auth entries [labs/private] - 10https://gerrit.wikimedia.org/r/745478 (https://phabricator.wikimedia.org/T293752) [09:44:06] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: ceph: add dummy caps for mgr auth entries [labs/private] - 10https://gerrit.wikimedia.org/r/745478 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:44:09] (03CR) 10Hashar: "I have added profile::beta::motd to classes fairly recently via https://gerrit.wikimedia.org/r/c/operations/puppet/+/699207 and I am pret" [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [09:44:19] (03CR) 10David Caro: "Can you run some pcc checks on those? (and maybe one osd, and one cloudvirt just in case)." [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:47:46] (03CR) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:48:36] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:04] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Volans) >>! In T296906#7555446, @Andrew wrote: > I don't much care about having to click through the partman step but imaging still fails. Now it stalls on > > 'Atte... [09:50:14] (03PS1) 10Jcrespo: dbbackups: Add s3 to db1145 (backup source) [puppet] - 10https://gerrit.wikimedia.org/r/745480 (https://phabricator.wikimedia.org/T296546) [09:51:55] (03PS4) 10Filippo Giunchedi: netconsole: move to anycast syslog [puppet] - 10https://gerrit.wikimedia.org/r/745204 [09:51:57] (03PS4) 10Filippo Giunchedi: graphite: enable netconsole client [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) [09:52:47] (03PS2) 10Btullis: Increase the timeout for Druid on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) [09:53:22] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32918/console" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (owner: 10Filippo Giunchedi) [09:54:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] RBAC: Add ClusterRole and ClusterRoleBinding for imagecatalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/745196 (https://phabricator.wikimedia.org/T287130) (owner: 10JMeybohm) [09:55:01] (03CR) 10David Caro: ceph: mgr: migrate keyring to new auth abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:57:45] (03CR) 10Arturo Borrero Gonzalez: "diff-heavy PCC https://puppet-compiler.wmflabs.org/compiler1002/32917/" [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:57:51] (03CR) 10Ema: [C: 03+1] netconsole: move to anycast syslog [puppet] - 10https://gerrit.wikimedia.org/r/745204 (owner: 10Filippo Giunchedi) [10:00:35] !log depool durum2002 [10:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:12] (03PS2) 10Muehlenhoff: Make ganeti2027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/745198 (https://phabricator.wikimedia.org/T294139) [10:02:38] !log pool durum2002 [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:31] RECOVERY - Check systemd state on durum2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:47] depooling and pooling durum makes me crave Turkish wraps... [10:05:03] +1 [10:05:13] (03PS4) 10Jbond: cfssl::config: support per profile auth keys [puppet] - 10https://gerrit.wikimedia.org/r/737036 [10:07:18] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) I looked at the stack trace and to me it looks like either a kernel bug (we've never run graphite with `5.10.0-8-amd64` as per [[ https://thanos... [10:08:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32920/console" [puppet] - 10https://gerrit.wikimedia.org/r/737036 (owner: 10Jbond) [10:09:03] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:09:08] (03CR) 10Ayounsi: [C: 03+1] "One nit, also probably worth mentioning we use that IP for that usecase in https://wikitech.wikimedia.org/wiki/Anycast_syslog" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (owner: 10Filippo Giunchedi) [10:09:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::config: support per profile auth keys [puppet] - 10https://gerrit.wikimedia.org/r/737036 (owner: 10Jbond) [10:09:44] jbond: you got a puppet-merge open :) [10:09:47] vgutierrez: happy for me to merge [10:09:52] :) [10:09:59] so go ahead and merge [10:10:10] it's locked by your puppet-merge instance [10:10:28] yes see above we both submited at the same time :). should be merged now [10:10:32] (03PS1) 10Elukey: profile::prometheus::alerts: fix eventgate's dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/745484 [10:10:34] thx [10:11:56] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti2027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/745198 (https://phabricator.wikimedia.org/T294139) (owner: 10Muehlenhoff) [10:12:07] (03CR) 10Btullis: "As per @elukey's recommendation and discussion here:" [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) (owner: 10Btullis) [10:12:39] (03PS5) 10Filippo Giunchedi: netconsole: move to anycast syslog [puppet] - 10https://gerrit.wikimedia.org/r/745204 [10:12:41] (03PS5) 10Filippo Giunchedi: graphite: enable netconsole client [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) [10:13:31] (03PS1) 10Lucas Werkmeister (WMDE): Fix LexemeHeader and GlossWidget mounting [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745372 (https://phabricator.wikimedia.org/T297328) [10:13:33] (03CR) 10Btullis: [C: 03+1] "Ah, many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/745484 (owner: 10Elukey) [10:14:05] (03CR) 10Volans: [C: 04-1] sre.ganeti.addnode: Pass the Ganeti group to gnt-node add (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 (owner: 10Muehlenhoff) [10:14:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32921/console" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (owner: 10Filippo Giunchedi) [10:15:10] (03CR) 10Ayounsi: [C: 03+1] "Nice, ship it! (after PCC)" [puppet] - 10https://gerrit.wikimedia.org/r/745204 (owner: 10Filippo Giunchedi) [10:15:47] (03PS2) 10Jcrespo: dbbackups: Add s3 to db1145 (backup source) [puppet] - 10https://gerrit.wikimedia.org/r/745480 (https://phabricator.wikimedia.org/T296546) [10:17:39] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add s3 to db1145 (backup source) [puppet] - 10https://gerrit.wikimedia.org/r/745480 (https://phabricator.wikimedia.org/T296546) (owner: 10Jcrespo) [10:18:00] (03CR) 10Filippo Giunchedi: [V: 03+1] netconsole: move to anycast syslog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745204 (owner: 10Filippo Giunchedi) [10:19:33] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] netconsole: move to anycast syslog [puppet] - 10https://gerrit.wikimedia.org/r/745204 (owner: 10Filippo Giunchedi) [10:20:34] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: enable netconsole client [puppet] - 10https://gerrit.wikimedia.org/r/745205 (https://phabricator.wikimedia.org/T297265) (owner: 10Filippo Giunchedi) [10:21:34] (03CR) 10Elukey: [C: 03+2] profile::prometheus::alerts: fix eventgate's dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/745484 (owner: 10Elukey) [10:23:56] (03CR) 10Elukey: [C: 03+1] Remove duplicate cluster variable from Druid check [alerts] - 10https://gerrit.wikimedia.org/r/744813 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [10:24:25] (03PS6) 10Vgutierrez: envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) [10:25:20] (03CR) 10Btullis: [C: 03+2] Remove duplicate cluster variable from Druid check [alerts] - 10https://gerrit.wikimedia.org/r/744813 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [10:25:59] (03CR) 10Btullis: [C: 03+2] Increase the timeout for Druid on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) (owner: 10Btullis) [10:26:23] (03CR) 10Btullis: [C: 03+2] Increase the timeout for Druid on the analytics cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) (owner: 10Btullis) [10:27:26] (03Merged) 10jenkins-bot: Remove duplicate cluster variable from Druid check [alerts] - 10https://gerrit.wikimedia.org/r/744813 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [10:30:02] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:35:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: workaround the sadness of php/rsyslog interactions [deployment-charts] - 10https://gerrit.wikimedia.org/r/744764 (owner: 10Giuseppe Lavagetto) [10:35:36] (03PS5) 10Vgutierrez: envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) [10:37:08] (03PS1) 10Filippo Giunchedi: hieradata: remove netconsole server from esams [puppet] - 10https://gerrit.wikimedia.org/r/745485 [10:37:10] (03PS1) 10Filippo Giunchedi: role: remove netconsole from ganeti [puppet] - 10https://gerrit.wikimedia.org/r/745486 [10:38:46] (03Merged) 10jenkins-bot: mediawiki: workaround the sadness of php/rsyslog interactions [deployment-charts] - 10https://gerrit.wikimedia.org/r/744764 (owner: 10Giuseppe Lavagetto) [10:39:13] (03CR) 10Michael Große: [C: 03+1] Fix LexemeHeader and GlossWidget mounting [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745372 (https://phabricator.wikimedia.org/T297328) (owner: 10Lucas Werkmeister (WMDE)) [10:41:16] Lucas_WMDE: hi, just making sure you saw my question on https://phabricator.wikimedia.org/T294355 ? [10:41:32] yes, I still need to reply to that, sorry [10:42:03] jouncebot: nowandnext [10:42:03] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [10:42:03] In 0 hour(s) and 17 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1100) [10:43:02] Lucas_WMDE: ok! sure no problem [10:43:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "deploying" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745372 (https://phabricator.wikimedia.org/T297328) (owner: 10Lucas Werkmeister (WMDE)) [10:43:34] ^ I’ll deploy this, let’s hope gate-and-submit finishes before the Citoid/Zotero window starts [10:45:50] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [10:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:21] (03PS1) 10Jbond: Release 2.0.0: Update files in preperation for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/745487 [10:47:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [10:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:09] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:38] godog: does Grafana have an easy way to get all the topics used by a dashboard, by any chance? [10:49:56] (03CR) 10David Caro: [C: 03+1] Release 2.0.0: Update files in preperation for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/745487 (owner: 10Jbond) [10:51:00] Lucas_WMDE: I don't know specifically, but you can get the JSON for the whole dashboard if you go to the "share" icon, select "export" and then "View JSON" [10:51:10] (03PS1) 10Thiemo Kreuz (WMDE): Fix special page displaying unescaped user input [extensions/FileImporter] (wmf/1.38.0-wmf.10) - 10https://gerrit.wikimedia.org/r/745373 (https://phabricator.wikimedia.org/T296605) [10:51:17] so with a bit of effort you could maybe parse that. [10:51:19] hm, I’ll try that with the next dashboard [10:51:21] thanks [10:51:35] * Lucas_WMDE is currently clicking “edit” on a bunch of panels ^^ [10:52:25] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:01] (PingOffloadMissingIP) firing: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [10:54:22] Lucas_WMDE: topics as in kafka topics? which dashboards? [10:54:48] Graphite things, not sure if “topics” is the right word (“metrics”?) e.g. https://grafana.wikimedia.org/d/000000167/wikidata-datamodel?orgId=1&refresh=30m [10:54:51] XioNoX topranks ^ re: pingoffloadmissingip [10:55:18] godog: thanks, looking [10:56:06] Lucas_WMDE: yeah that'd be graphite metrics, what topranks said essentially, download the dashboard json and look for 'targets' with the graphite query [10:56:20] beats checking each panel individually [10:56:32] yeah, piping into jq -r '.. | .target? | select(.)' seems to work well [10:56:40] and then I can copy+paste the metrics out of the expressions with double-click [10:56:44] thanks! [10:56:57] sure np! [10:57:47] topranks: runbook on https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert fyi [10:58:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [10:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:06] but also looks like it recovered? [10:58:16] PROBLEM - Host ganeti2027 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:28] PROBLEM - ores on ores2001 is CRITICAL: connect to address 10.192.0.12 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [10:58:31] moritzm: ^ [10:58:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [10:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:40] still shows when I refresh the alerts page [10:59:17] topranks: the alert checks this graph: https://grafana.wikimedia.org/d/000000513/ping-offload?viewPanel=11&orgId=1&refresh=30s [10:59:42] (03PS1) 10Filippo Giunchedi: pontoon: default to email blackhole [puppet] - 10https://gerrit.wikimedia.org/r/745489 (https://phabricator.wikimedia.org/T296373) [10:59:49] ok yeah thanks, seems to have cleared, alert page probably lagging behind [11:00:04] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1100). [11:00:25] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [11:00:26] XioNoX: that is surprising, I'm logged into the server via SSH :-) [11:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:40] is anybody working on ores2001? [11:00:45] otherwise I'll check what's happening [11:01:22] alert1001:~$ ping ganeti2027 -4 doesn't work [11:01:26] v6 works [11:01:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [11:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:58] moritzm: could it somehow have lost its v4 IP? [11:02:07] XioNoX: nvm, found the error, should be rectified with next reboot [11:02:16] cool! [11:02:31] (03PS2) 10Jbond: Release 2.0.0: Update files in preparation for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/745487 (https://phabricator.wikimedia.org/T297356) [11:02:40] (03Merged) 10jenkins-bot: Fix LexemeHeader and GlossWidget mounting [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745372 (https://phabricator.wikimedia.org/T297328) (owner: 10Lucas Werkmeister (WMDE)) [11:02:44] (03CR) 10Jbond: [C: 03+2] Release 2.0.0: Update files in preparation for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/745487 (https://phabricator.wikimedia.org/T297356) (owner: 10Jbond) [11:02:48] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:02:57] also FWIW the alert is defined in alerts.git it isn't based on grafana [11:03:07] looks like mvolz isn’t around, I assume it’s okay for me to deploy the WikibaseLexeme backport [11:03:22] PROBLEM - Host ores2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:25] godog: I mean it tracks the same values as what's graphed? [11:04:08] Either way it's odd, ping2002 has all the IPs the CRs in codfw are configured to redirect configured on it. [11:04:34] testing my backport on mwdebug1001 [11:04:37] XioNoX: ah, kinda yeah, in alerts.git there's no rate() but it is otherwise the same metric underneath yeah [11:04:37] (03PS5) 10Vgutierrez: envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) [11:04:46] (03Merged) 10jenkins-bot: Release 2.0.0: Update files in preparation for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/745487 (https://phabricator.wikimedia.org/T297356) (owner: 10Jbond) [11:04:51] topranks: maybe something did a broadcast ping for something different? [11:04:52] RECOVERY - Host ganeti2027 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [11:05:40] works, I’ll sync [11:06:02] moritzm: ganeti2027 is in the same rack (A4) as ores2001, that is up but seems without connectivity - anything happening to that rack? [11:06:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [11:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:37] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:07:38] weird [11:07:38] elukey@asw-a-codfw> show interfaces descriptions | match ores2001 [11:07:38] ge-4/0/17 up up ores2001 [11:07:39] let me check the host [11:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:47] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/WikibaseLexeme/resources/widgets/: Backport: [[gerrit:745372|Fix LexemeHeader and GlossWidget mounting (T297328)]] (duration: 01m 06s) [11:07:50] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:51] T297328: Glosses get double margin-right because of duplicated div - https://phabricator.wikimedia.org/T297328 [11:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:47] elukey: switch logs are fine [11:08:51] elukey: what's up with ores? [11:09:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:39] I'm going to attempt a small beta cluster deployment, to enable versioned maps: T294339 [11:09:40] T294339: Deploy versioned maps to the beta cluster - https://phabricator.wikimedia.org/T294339 [11:09:50] XioNoX: it lost its connectivity, I can't even ping other nodes from it (I am in the mgmt console) [11:09:52] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:09:57] XioNoX: I tested pinging broadcast IP on that subnet, 10.192.3.255, from cr1-codfw. Could see requests hitting ping2002, but it's not shown up as InAddrErrors on the graph. [11:09:58] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [11:10:02] Good thinking though! [11:10:12] I'll move on I think, if it happens again we can dig deeper [11:10:22] elukey: it looks healthy? [11:10:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:25] (03CR) 10Awight: [C: 03+2] "Merging beta cluster config for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745490 (https://phabricator.wikimedia.org/T294339) (owner: 10Awight) [11:10:46] elukey: could it have core-dumped? [11:10:54] er, kernel panicked [11:10:54] XioNoX: from what view? I tried to restart networking (via sysctl) and the interface didn't come up :D [11:11:01] (03PS2) 10Ladsgroup: deployment-prep: Remove motd class from hiera [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) [11:11:05] (03CR) 10Ladsgroup: deployment-prep: Remove motd class from hiera (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [11:11:10] (03Merged) 10jenkins-bot: beta: enable Kartographer versioned maps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745490 (https://phabricator.wikimedia.org/T294339) (owner: 10Awight) [11:11:17] anyway, I'll try to reboot [11:11:19] elukey: anything suspicous in the logs? [11:11:41] do I need to do anything special to update ResourceLoader to deploy a JS change? I still seem to be getting the old JavaScript when the WikimediaDebug extension isn’t enabled [11:11:49] (and I think I synced the right directory…) [11:12:00] XioNoX: nothing in dmesg, ethtool was ok, but pings didn't work.. trying to reboot [11:12:40] (03PS6) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) [11:12:59] elukey: ok, thx! [11:13:20] PROBLEM - Host ms-be1028 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:27] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] beta: enable Kartographer versioned maps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745490 (https://phabricator.wikimedia.org/T294339) (owner: 10Awight) [11:13:52] ok, looks like I just needed to wait five minutes :) [11:13:54] !log reboot ores2001 (lost connectivity, we suspect some weird problem with the NIC, but no traces in the kernel logs) [11:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:57] (03PS1) 10JMeybohm: Add auth_key for k8s_staging PKI profile [labs/private] - 10https://gerrit.wikimedia.org/r/745494 [11:13:58] Lucas_WMDE: take your time, just o/ when I can rebase a small beta config. [11:14:08] RECOVERY - Host ores2001 is UP: PING OK - Packet loss = 0%, RTA = 31.56 ms [11:14:25] I’m done, feel free to deploy awight (unless XioNoX / elukey object, but I think their work seems unrelated) [11:14:37] Lucas_WMDE: yeah, proceed [11:14:40] this is unrelated [11:14:41] good for me [11:14:43] yes yes +1 [11:15:02] ores2001 should recover soon [11:15:38] RECOVERY - ores on ores2001 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 3.803 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [11:15:50] \o/ [11:16:01] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:16:04] RECOVERY - Host ms-be1028 is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [11:17:19] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:18:07] (03PS1) 10JMeybohm: Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 [11:18:52] Lucas_WMDE: ty! [11:19:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:18] !log Changing export policy applied on eqiad CRs for local confed to not rewrite next-hop for routes learnt from other WMF POPs (T295672) [11:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:23] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [11:19:46] hey Lucas_WMDE, a question about the deploys coming up in the backport window: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/744066 this is three separate files being touched, what's the worst that can happen here? [11:20:21] hm, we’d need to figure out in which order to sync them [11:20:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:30] I assume the .yaml file isn’t used at runtime, only to build dblists/ [11:20:34] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:50] but esanders is probably the best person to decide whether dblist or IS should be synced first [11:20:59] (03PS2) 10JMeybohm: Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 [11:21:31] probably IS.php first? disable for anons while it’s still disabled for everyone, then enable via dblist while leaving disabled for anons? [11:21:42] I can never remember whether a rebase is enough, or if beta config also needs to be scap sync-file'd to production... [11:21:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/745486 (owner: 10Filippo Giunchedi) [11:22:03] I usually sync it anyways but I don’t think it’s required [11:22:16] argh that is the trap I was leaning into [11:22:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/745485 (owner: 10Filippo Giunchedi) [11:22:55] (03PS1) 10David Caro: Review access change [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745377 [11:23:00] a rebase is required, a sync is not [11:23:37] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [11:23:38] majavah: thanks! I'll add this to the instructions one day... [11:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:43] (03PS2) 10Ladsgroup: xdummy: Remove xdummy [puppet] - 10https://gerrit.wikimedia.org/r/745287 (https://phabricator.wikimedia.org/T133183) [11:24:03] Lucas_WMDE: ok, I guess we'll see when he shows up. No trainees for the window today btw. [11:24:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] xdummy: Remove xdummy [puppet] - 10https://gerrit.wikimedia.org/r/745287 (https://phabricator.wikimedia.org/T133183) (owner: 10Ladsgroup) [11:24:31] Lucas_WMDE: I'm done, in that case :-D [11:24:37] 2× ok :) [11:24:56] Efficient! [11:26:34] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10MoritzMuehlenhoff) 05Resolved→03Open I can't connect to the serial console of ganeti2027 with our management password, but ganeti2028 works. Does 2027 maybe still us... [11:26:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) [11:27:29] (03PS1) 10Filippo Giunchedi: prometheus: use ip_protocol_fallback only on newer blackbox-exporter versions [puppet] - 10https://gerrit.wikimedia.org/r/745498 (https://phabricator.wikimedia.org/T291946) [11:29:40] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32923/console" [puppet] - 10https://gerrit.wikimedia.org/r/745498 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:30:17] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:22] (03PS7) 10JMeybohm: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) [11:31:09] (03PS2) 10Filippo Giunchedi: prometheus: use ip_protocol_fallback only on newer blackbox-exporter versions [puppet] - 10https://gerrit.wikimedia.org/r/745498 (https://phabricator.wikimedia.org/T291946) [11:31:11] (03CR) 10JMeybohm: calico: Allow to configure the IPAM module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [11:33:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32924/console" [puppet] - 10https://gerrit.wikimedia.org/r/745498 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:35:06] (03PS3) 10JMeybohm: Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 [11:35:53] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: use ip_protocol_fallback only on newer blackbox-exporter versions [puppet] - 10https://gerrit.wikimedia.org/r/745498 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:38:22] !log added ganeti2027 to ganeti codfw cluster T294139 [11:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:27] T294139: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 [11:39:57] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add auth_key for k8s_staging PKI profile [labs/private] - 10https://gerrit.wikimedia.org/r/745494 (owner: 10JMeybohm) [11:41:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10ArielGlenn) [11:41:48] (03CR) 10Jelto: [C: 03+1] calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [11:42:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10ArielGlenn) I've marked the dumps related modules as keep; we will want to get rid of them in awhile but we cannot do so yet. [11:44:16] !log Re-enabling multihop BGP session from cr1-eqiad to cr2-eqord (T295672) [11:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:21] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [11:45:30] (03PS1) 10JMeybohm: Fix auth_key for k8s_staging PKI profile [labs/private] - 10https://gerrit.wikimedia.org/r/745499 [11:45:51] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix auth_key for k8s_staging PKI profile [labs/private] - 10https://gerrit.wikimedia.org/r/745499 (owner: 10JMeybohm) [11:52:03] (03PS1) 10Hnowlan: restbase: add new hosts restbase202[456] [puppet] - 10https://gerrit.wikimedia.org/r/745501 (https://phabricator.wikimedia.org/T297282) [11:58:24] (03PS1) 10Muehlenhoff: Enable ganeti2025 as ganeti server [puppet] - 10https://gerrit.wikimedia.org/r/745519 (https://phabricator.wikimedia.org/T282603) [11:59:34] (03CR) 10Hashar: [C: 03+1] "Thanks for the explanation :)" [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [12:00:04] Amir1, Lucas_WMDE, and apergos: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1200). [12:00:04] kostajh and MatmaRex: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] !log Changing export policy applied on ulsfo CRs for local confed to not rewrite next-hop for routes learnt from other WMF POPs (T295672) [12:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:12] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [12:00:29] there are two patches in the window today, no trainees signed up. [12:00:37] \o [12:00:38] I’m around but in a meeting [12:00:51] hello [12:00:57] Amir1: around? I'd like to not be the sole deployer [12:01:11] apergos: I'm always around [12:01:31] I can deploy if needed [12:01:34] hello, MatmaRex! the patch you want deployed touches three separate files. We need to know (or you need to know, if you are self-deploying) the right order for those to be synced. [12:02:20] hello kostajh! are you a self deployer or do you need someone to do that for you? sorry if I ask the same thing every week, I also forget every week :-D [12:02:30] apergos: hmm, i guess InitialiseSettings.php first, then the other two in any order (not sure which one of them is redundant) [12:02:43] I can deploy my patch, but if you want to do it, please feel free [12:02:55] apergos: (but it shouldn't matter much) [12:02:57] we encourage people to deploy themselves :-) [12:03:03] i don't have access [12:03:31] https://wikitech.wikimedia.org/wiki/SRE/Production_access :-) [12:03:54] apergos: actually, I think I'm going to wait on my patch, because I have a question for the author. [12:04:24] kostajh: do you want to pull it form the window entirely? or just wait until later in the window when you have the answer? [12:04:34] (03CR) 10EllenR: [C: 03+1] wmf-config: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [12:04:48] apergos: hold on a sec, just doubting myself for a moment. I think it's fine to go now, just reviewing it once more [12:04:54] waiting. [12:05:28] apergos: indeed. it's fine. So, I'll deploy it myself [12:05:32] ok! [12:06:00] who is deploying MatmaRex's patch? should I start with mine? (it will take 25 minutes for CI to do its thing) [12:06:15] kostajh: go ahead please [12:06:23] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745371 (https://phabricator.wikimedia.org/T297248) (owner: 10Kosta Harlan) [12:06:28] MatmaRex: do you want me to review your patch? [12:06:49] sure [12:06:54] or, you know, anyone :D [12:07:26] um [12:07:32] you've not got a +1 from anyone yet? [12:07:42] that's really not our role here, hrm [12:07:42] (03CR) 10Bartosz Dziewoński: [C: 03+1] Enable VE on zh.wiki, but only for logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744066 (https://phabricator.wikimedia.org/T296269) (owner: 10Esanders) [12:08:07] (03CR) 10Ladsgroup: [C: 03+2] Enable VE on zh.wiki, but only for logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744066 (https://phabricator.wikimedia.org/T296269) (owner: 10Esanders) [12:08:16] apergos: ed wrote the patch, here's my +1 if you want it to be officially approved by two people [12:08:26] :-D [12:08:32] I personally prefer zhwiki as that's the db name but I'm biased [12:08:38] (instead of zh.wiki) [12:09:05] Amir1: feel free to edit the commit message, we don't mind [12:09:16] nah, it's style [12:09:55] (03Merged) 10jenkins-bot: Enable VE on zh.wiki, but only for logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744066 (https://phabricator.wikimedia.org/T296269) (owner: 10Esanders) [12:10:47] MatmaRex: live in mwdebug1002, please test 🥺 [12:11:01] looking [12:11:15] (03CR) 10Ladsgroup: [C: 03+2] Major fixes to maintenance/pruneRevData.php [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745239 (https://phabricator.wikimedia.org/T290769) (owner: 10Ladsgroup) [12:11:50] (03PS1) 10Ladsgroup: Change logic of pruneChange to allow deleting rows more flexibly [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745378 (https://phabricator.wikimedia.org/T296380) [12:11:55] (03CR) 10Ladsgroup: [C: 03+2] Change logic of pruneChange to allow deleting rows more flexibly [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745378 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [12:12:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:22] please do look into getting deployer access, there's no reason not to, and it's good to have the ability. plus the more people self deploy, the less work for the people running the window :-P [12:12:22] while we are at it, I'll backport some straightforward patches, they are noop for prod (maint scripts) [12:12:33] Amir1: looks good [12:12:44] Amir1: that FlaggedRevs patch seems to pass the wrong string into getOption()? [12:12:49] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/745378/1/maintenance/pruneRevData.php uses 'sleep' twice [12:12:50] Amir1: don't forget to add those in the deployment calendar for the window just for the record [12:12:53] !lof installing PHP 7.3 security updates [12:12:58] okay, syncing [12:13:03] i don't really want to have production access btw, i think i'd need to upgrade my own security way too much before i could agree to that [12:13:09] Lucas_WMDE: possible 🤦 [12:13:15] like getting a separate laptop for work stuff and personal stuff [12:13:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:13:24] and never installing an npm package again [12:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:36] I don't see the problem re: npm :-P :-D [12:13:49] syncing the IS.php right now [12:14:00] (of course if you are not comfortable getting access for any reason, that's ok.) [12:14:33] well i work on code where it wasn't my decision to require an npm build process on every commit. such is life [12:14:42] Amir1: added two comments to the original version of the change on master [12:14:42] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:744066|Enable VE on zh.wiki, but only for logged-in users (T296269)]] (duration: 01m 06s) [12:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:47] T296269: Enable VisualEditor for Chinese Wikipedia - https://phabricator.wikimedia.org/T296269 [12:16:03] thanks! [12:16:27] I suppose containerised local testbeds for everything might help with that. someday. [12:16:33] !log ladsgroup@deploy1002 Synchronized dblists/visualeditor-nondefault.dblist: Config: [[gerrit:744066|Enable VE on zh.wiki, but only for logged-in users (T296269)]] (duration: 01m 05s) [12:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:11] apergos: isn’t that what https://github.com/wikimedia/fresh is for? (at least as far as npm goes) [12:19:13] I was thinking more generally, that we ought to have such environments for everything we develop [12:19:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:39] Lucas_WMDE: you earned yourself a patch to review: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/745529 [12:19:59] Amir1: but do I *really* want to +2 something in FlaggedRevs 🤔 [12:20:01] MatmaRex: so everything is good? [12:20:15] Lucas_WMDE: help me in this mess :P [12:20:25] Amir1: yeah, i think so? [12:20:33] cool [12:20:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) p:05High→03Low Ok I have applied the changes on cr1-eqiad and cr2-eqiad. All went as expected. Paste w... [12:21:39] Amir1: do you need to sync 'zhwiki.yaml' as well? (the config is already behaving correctly when not using mwdebug servers though) [12:21:58] zhwiki.yaml is not used in production directly [12:22:21] right [12:22:23] I sync it for the sake of consistency [12:23:08] (03PS3) 10Ladsgroup: deployment-prep: Remove motd class from hiera [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) [12:23:13] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] deployment-prep: Remove motd class from hiera [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [12:23:31] !log ladsgroup@deploy1002 Synchronized wmf-config/config/zhwiki.yaml: Config: [[gerrit:744066|Enable VE on zh.wiki, but only for logged-in users (T296269)]] (duration: 01m 05s) [12:23:34] +1 for syncing that indeed [12:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:35] T296269: Enable VisualEditor for Chinese Wikipedia - https://phabricator.wikimedia.org/T296269 [12:26:59] RECOVERY - snapshot of x1 in eqiad on alert1001 is OK: Last snapshot for x1 at eqiad (db1116.eqiad.wmnet:3320) taken on 2021-12-09 12:05:19 (276 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:27:09] (03PS1) 10Ladsgroup: media: Invalidate all file-djvu WAN caches [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745379 (https://phabricator.wikimedia.org/T296001) [12:27:22] (03CR) 10Ladsgroup: [C: 03+2] "To catch the train" [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745379 (https://phabricator.wikimedia.org/T296001) (owner: 10Ladsgroup) [12:27:32] I have this one, I forgot :P [12:32:21] (03PS1) 10Cathal Mooney: Removing routing policy iBGP_nh_self as it is no longer used. [homer/public] - 10https://gerrit.wikimedia.org/r/745536 (https://phabricator.wikimedia.org/T295672) [12:36:23] (03Merged) 10jenkins-bot: CacheDecorator: Bump cache version [extensions/GrowthExperiments] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745371 (https://phabricator.wikimedia.org/T297248) (owner: 10Kosta Harlan) [12:36:25] (03Merged) 10jenkins-bot: Major fixes to maintenance/pruneRevData.php [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745239 (https://phabricator.wikimedia.org/T290769) (owner: 10Ladsgroup) [12:36:28] (03Merged) 10jenkins-bot: Change logic of pruneChange to allow deleting rows more flexibly [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745378 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [12:37:09] ok, I'll start with deploying my patch now [12:38:17] +1 :] [12:39:43] kostajh: for the patches afterwards, let me know once you're done. Thanks! [12:40:27] I see four extensions with "Changes not stage for commit". Echo, EntitySchema, SecurePoll, VisualEditor. Do I need to worry about those? [12:40:53] I'm at the step where I will run git rebase and git submodule update (https://deploy-commands.toolforge.org/bacc/745371) [12:41:08] let me check [12:41:59] kostajh: after what command exactly? [12:42:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:25] Amir1: "git status" [12:42:59] kostajh: yup, that's fine [12:43:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:47] (03CR) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [12:46:55] (03CR) 10Jbond: [C: 03+1] "LGTM but agree good to get the final ok from dcops" [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [12:47:22] !log kharlan@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/CacheDecorator.php: Backport: [[gerrit:745371|CacheDecorator: Bump cache version (T297248)]] (duration: 01m 05s) [12:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:27] T297248: TypeError: Argument 1 passed to GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::hasMinimumWidth() must be of the type integer, null given, called in /srv/mediawiki/php-1.38.0-wmf.12/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php on line 242 - https://phabricator.wikimedia.org/T297248 [12:47:31] Amir1: all done, over to you [12:47:32] 10SRE-Access-Requests: Add damilare to icinga - https://phabricator.wikimedia.org/T297383 (10Damilare) [12:47:44] \o/ [12:48:13] 10SRE, 10SRE-Access-Requests: Add damilare to icinga - https://phabricator.wikimedia.org/T297383 (10Damilare) a:05Jeff_G→03Jgreen [12:49:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:18] (03Merged) 10jenkins-bot: media: Invalidate all file-djvu WAN caches [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745379 (https://phabricator.wikimedia.org/T296001) (owner: 10Ladsgroup) [12:49:34] cool [12:50:10] (03CR) 10Jbond: Fix auth_key for k8s_staging PKI profile (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/745499 (owner: 10JMeybohm) [12:50:10] can you say a word about the uncomitted changes, Amir1, and why those aren't a problem? then anyone following along (I am optimistic that there are such people) will learn something from it :-) [12:50:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:25] apergos: well, I can't :) [12:50:30] :-D [12:52:35] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove netconsole server from esams [puppet] - 10https://gerrit.wikimedia.org/r/745485 (owner: 10Filippo Giunchedi) [12:53:11] jbond: I'll merge your change in labs/private [12:55:05] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:745239|Major fixes to maintenance/pruneRevData.php (T290769)]] (duration: 01m 05s) [12:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:10] T290769: Create a regular job for running pruneRevData.php - https://phabricator.wikimedia.org/T290769 [12:55:58] (03CR) 10Filippo Giunchedi: [C: 03+2] role: remove netconsole from ganeti [puppet] - 10https://gerrit.wikimedia.org/r/745486 (owner: 10Filippo Giunchedi) [12:56:05] (03PS2) 10Filippo Giunchedi: role: remove netconsole from ganeti [puppet] - 10https://gerrit.wikimedia.org/r/745486 [12:56:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:10] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:745378|Change logic of pruneChange to allow deleting rows more flexibly (T296380)]] (duration: 01m 05s) [12:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:14] T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380 [12:57:44] (03PS2) 10Filippo Giunchedi: prometheus: remove broken blackbox check for tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/745213 (https://phabricator.wikimedia.org/T291946) [12:58:59] (03PS1) 10Jbond: C:cfssl::config: improve error message [puppet] - 10https://gerrit.wikimedia.org/r/745540 [12:59:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove broken blackbox check for tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/745213 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:59:15] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/media/DjVuHandler.php: Backport: [[gerrit:745379|media: Invalidate all file-djvu WAN caches (T296001)]] (duration: 01m 05s) [12:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:20] T296001: DjVuHandler: getDimensionInfoFromMetaTree: PHP Notice: Undefined index: pages - https://phabricator.wikimedia.org/T296001 [12:59:46] (03PS4) 10Jbond: Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [13:00:05] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:00:11] (03PS5) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) [13:00:36] (03CR) 10JMeybohm: [C: 03+2] calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [13:02:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:05:24] (03Merged) 10jenkins-bot: calico: Allow to configure the IPAM module [deployment-charts] - 10https://gerrit.wikimedia.org/r/740858 (https://phabricator.wikimedia.org/T296303) (owner: 10JMeybohm) [13:06:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32931/console" [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [13:09:14] (03CR) 10Jbond: [C: 03+1] diamond: delete collector::servicestats* [puppet] - 10https://gerrit.wikimedia.org/r/744841 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [13:09:38] (03CR) 10Jbond: [C: 03+1] contint: delete deployment_dir class [puppet] - 10https://gerrit.wikimedia.org/r/744839 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [13:10:29] (03CR) 10Jbond: [C: 03+1] contint: delete the proxy_gerrit class [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [13:11:59] (03PS1) 10JMeybohm: Remove now unused key profile::pki::multirootca::intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/745541 [13:12:17] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove now unused key profile::pki::multirootca::intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/745541 (owner: 10JMeybohm) [13:13:59] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32932/console" [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [13:16:14] (03PS1) 10Ssingh: durum: update site.css (fix display issues on small screens) [puppet] - 10https://gerrit.wikimedia.org/r/745542 [13:18:27] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32933/console" [puppet] - 10https://gerrit.wikimedia.org/r/745542 (owner: 10Ssingh) [13:19:56] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: update site.css (fix display issues on small screens) [puppet] - 10https://gerrit.wikimedia.org/r/745542 (owner: 10Ssingh) [13:19:58] (03CR) 10Ayounsi: [C: 03+1] Removing routing policy iBGP_nh_self as it is no longer used. [homer/public] - 10https://gerrit.wikimedia.org/r/745536 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:23:52] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:58] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:53] RECOVERY - dump of m2 in codfw on alert1001 is OK: Last dump for m2 at codfw (db2078.codfw.wmnet:3322) taken on 2021-12-09 07:36:13 (505 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:28:39] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:44] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:58] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:13] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:18] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:27] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:54] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:24] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:43] anyone here could help me with a scap error happening on the beta cluster? `13:33:09 deploy failed: Did not find a file named maps-canaries in search path: ['/srv/deployment/kartotherian/deploy/scap', '/etc/dsh/group']` [13:38:06] I'm running a script on dewiki, don't be alarmed [13:38:32] mbsantos: I can't help you but I suggest asking in #wikimedia-releng [13:38:49] thansk Amir1 [13:39:30] (03CR) 10Jbond: [C: 03+2] C:cfssl::config: improve error message [puppet] - 10https://gerrit.wikimedia.org/r/745540 (owner: 10Jbond) [13:39:57] !log installing tar security updates on stretch [13:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:21] (03PS5) 10Jbond: Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [13:45:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [13:50:09] (03CR) 10Cathal Mooney: [C: 03+2] Removing routing policy iBGP_nh_self as it is no longer used. [homer/public] - 10https://gerrit.wikimedia.org/r/745536 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:50:31] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Removing routing policy iBGP_nh_self as it is no longer used. [homer/public] - 10https://gerrit.wikimedia.org/r/745536 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:51:46] could somebody update the topic? sukhe was on clinic duty till yesterday and I'm taking over for the rest of the week [13:51:49] thanks [13:55:01] <_joe_> vgutierrez: sure [13:55:04] !log installing postgres security updates on puppetdb2002 [13:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:35] thx [13:58:16] !log installing postgres security updates on netboxdb hosts [13:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:00] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) 05Open→03In progress [14:07:42] (03CR) 10Ayounsi: "Thanks! Replies inline." [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [14:07:49] !log installing cups security updates on stretch hosts [14:07:49] (03PS1) 10Bartosz Dziewoński: [WIP] VE on zh.wiki: Enable SET mode but defaulting to multi-tab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745546 [14:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:02] (03PS6) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) [14:10:03] (03CR) 10Ayounsi: "I also updated the kafka topic to be more explicit with "network_flows_internal"" [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [14:11:10] <_joe_> !isspull [14:12:17] <_joe_> !issync [14:12:17] Syncing #wikimedia-operations (requested by joe_oblivian) [14:12:18] No updates for #wikimedia-operations [14:12:26] <_joe_> !ispull [14:12:47] <_joe_> jbond: did you merge your change? [14:13:21] <_joe_> ah right I missed the last line [14:13:21] _joe_: should be meregd, was merged via gate and submit (at 14:07) [14:13:31] (03CR) 10Ottomata: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/745484 (owner: 10Elukey) [14:13:58] <_joe_> !issync [14:13:58] Syncing #wikimedia-operations (requested by joe_oblivian) [14:13:59] No updates for #wikimedia-operations [14:14:05] <_joe_> uhm [14:14:06] try now [14:14:22] !log installing python-babel security updates [14:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:35] <_joe_> yeah [14:14:39] <_joe_> !issync [14:14:40] Syncing #wikimedia-operations (requested by joe_oblivian) [14:14:42] Set /cs flags #wikimedia-operations vgutierrez +Aiotv [14:14:44] Set /cs flags #wikimedia-operations jbond +Aiotv [14:14:46] Set /cs flags #wikimedia-operations akosiaris +Aiotv [14:15:03] thats worked thanks [14:15:21] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:30] lovely :) [14:17:01] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:37] RECOVERY - dump of x1 in eqiad on alert1001 is OK: Last dump for x1 at eqiad (db1116.eqiad.wmnet:3320) taken on 2021-12-09 12:47:46 (46 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:17:54] thanks jayme :) [14:19:09] elukey: yw, does not look like anything is on fire :) [14:20:42] !log installing postgres security updates on codfw maps master (and replicas) [14:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:45] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:57] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:11] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:02] 10SRE, 10SRE-Access-Requests: Add damilare to icinga - https://phabricator.wikimedia.org/T297383 (10Vgutierrez) p:05Triage→03Medium per T235676 as @MoritzMuehlenhoff mentioned there, @Damilare should be added to cn=wmf group [14:25:44] (03PS3) 10Majavah: rabbitmq: Add support for listening on TLS [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) [14:25:58] 10SRE, 10SRE-Access-Requests: Add damilare to icinga - https://phabricator.wikimedia.org/T297383 (10Jgreen) a:05Jgreen→03None Removing myself as owner, this is normally handled by SRE. [14:26:56] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:51] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [14:29:03] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:32] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32935/console" [puppet] - 10https://gerrit.wikimedia.org/r/745501 (https://phabricator.wikimedia.org/T297282) (owner: 10Hnowlan) [14:30:41] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:53] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:51] (03PS2) 10Hnowlan: restbase: add new hosts restbase202[456] [puppet] - 10https://gerrit.wikimedia.org/r/745501 (https://phabricator.wikimedia.org/T297282) [14:35:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:01] this is me I [14:36:03] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:37] !log updated calico chart to calico-0.1.15 on all kubernetes clusters (introducing IPAMConfig) - T296303 [14:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:42] T296303: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 [14:39:56] !log installing postgres security updates on eqiad maps master (and replicas) [14:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:19] (03PS1) 10Vgutierrez: admin: Add damilare to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/745551 (https://phabricator.wikimedia.org/T297383) [14:43:57] (03PS6) 10JMeybohm: Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 [14:45:41] (03PS6) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) [14:46:00] (03CR) 10Jbond: [C: 04-1] "see inline for the -1" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [14:48:50] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "reviewed PCC, looks good https://puppet-compiler.wmflabs.org/compiler1001/32936/" [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:50:08] (03CR) 10Jbond: [C: 03+1] Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [14:50:11] (03PS4) 10Majavah: rabbitmq: Add support for listening on TLS [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) [14:50:21] (03CR) 10JMeybohm: "Just changed comments, puppet private data committed as c5b4a3d611591a45f9ac0b1b0864d927a29dbb5c" [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [14:50:50] (03PS5) 10Majavah: rabbitmq: Add support for listening on TLS [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) [14:51:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [14:51:18] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [14:51:39] PROBLEM - DPKG on maps2010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:51:45] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:53:16] (PingOffloadMissingIP) firing: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [14:55:19] (03PS2) 10Elukey: admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) [14:55:21] (03PS1) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [14:55:48] (03CR) 10Jsn.sherman: [C: 03+1] "heads-up: Ie2e7afcbb94db95fad3cc45b21f88567920ecff8 has been merged and is available in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [14:55:55] (03CR) 10JMeybohm: [C: 03+2] Add a dedicated profile for k8s_staging [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [14:57:30] topranks XioNoX sth I forgot to mention earlier, due to the way the pingoffload alert is wrote (i.e. if absolute errors > threshold) it will resolve on reboot or when the counter is otherwise reset, JFYI [14:57:54] ah ok thanks. [14:58:07] I was looking into something else but was about to go checking what was wrong on the host. [14:58:59] yeah the host is fine now I think, or at least looks like it from the linked dashboard [14:59:01] PROBLEM - DPKG on maps2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:59:26] "is written" even, I can verb [14:59:31] (03PS1) 10Arturo Borrero Gonzalez: ceph: mgr: ensure all the directory tree is managed by puppet [puppet] - 10https://gerrit.wikimedia.org/r/745556 [14:59:55] (03CR) 10Jbond: Review access change (031 comment) [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745377 (owner: 10David Caro) [15:00:15] "is wrote" is exactly how to see it here in inner city Dublin ;) [15:00:22] "is writ" even [15:00:33] s/see/say/ [15:01:30] lolz [15:01:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/32937/" [puppet] - 10https://gerrit.wikimedia.org/r/745556 (owner: 10Arturo Borrero Gonzalez) [15:01:44] I do miss the odd "JAYSUS" shouted in the street [15:01:51] haha [15:04:16] (03CR) 10Majavah: [C: 04-1] "-1 while I test somewhere how this will affect existing clients before they are reconfigured." [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [15:06:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:07:53] PROBLEM - DPKG on maps2005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:08:48] godog: perhaps you could help me, I am trying to dig out where that alert is defined to see if I can work out how to improve it. [15:08:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:09:08] I've been reading some of the docs but not managed to find it yet, if you can point me in the right direction it might help [15:12:45] topranks: sure, check out operations/alerts.git in team-netops/ [15:13:06] ah ok yep will do [15:13:44] there is an alert configured on the panel in Grafana too, I need to re-read what you said earlier about that... that's not the source for this I take it? [15:16:26] topranks: correct, alerts.git is where the alert is defined, though the underlying metric is the same [15:16:37] the grafana panel uses rate(...) and the alert does not [15:16:43] yeah I can see that. [15:16:43] (03PS1) 10Jgiannelos: kartographer: Enable tegola on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745558 [15:17:27] Ok shouldn't be hard in theory to change this I guess, I'll see what I can do. [15:17:50] topranks: ok! let me know [15:18:32] are there any docs on how to test the expressions? I guess I can make an API query to Prometheus? [15:18:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/745551 (https://phabricator.wikimedia.org/T297383) (owner: 10Vgutierrez) [15:18:59] yes you can use thanos.w.o with some caveats [15:19:25] there's also a guide at https://wikitech.wikimedia.org/wiki/Alertmanager [15:20:01] yep reading through that now... overdue for me to dig into it deeper thanks :) [15:20:36] sure np! keep in mind that with rate(...) > threshold the alert will auto-resolve once the rate comes back down [15:20:48] e.g. in a spike of errors [15:22:53] RECOVERY - DPKG on maps2010 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:23:02] (03PS1) 10Hnowlan: varnish: add second wikimedia enterprise elastic IP [puppet] - 10https://gerrit.wikimedia.org/r/745560 (https://phabricator.wikimedia.org/T294798) [15:27:54] (03CR) 10Vgutierrez: [C: 03+2] admin: Add damilare to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/745551 (https://phabricator.wikimedia.org/T297383) (owner: 10Vgutierrez) [15:30:09] RECOVERY - DPKG on maps2006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:42:32] !log run `racadm racreset [15:42:35] uff [15:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:52] !log run `racadm racreset` on kafka-main2003 - mgmt console not reachable via ssh (but pingable) [15:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:29] !log run `ipmitool -I lanplus -H "kafka-main2003.mgmt.codfw.wmnet" -U root -E mc reset cold` [15:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:37] !log run `ipmitool -I lanplus -H "kafka-main2003.mgmt.codfw.wmnet" -U root -E mc reset cold` from cumin2001 [15:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:45] I am going to clean up the sal :) [15:50:51] PROBLEM - Host db1102.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:59] (03CR) 10Jbond: [C: 03+1] rabbitmq: Add support for listening on TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [15:52:01] (03CR) 10JMeybohm: [C: 03+1] "Just a nit about comments, looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [15:52:41] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:31] (03PS1) 10Cathal Mooney: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) [15:59:36] (03CR) 10jerkins-bot: [V: 04-1] Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [16:00:23] (03PS2) 10Cathal Mooney: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) [16:03:29] RECOVERY - Host db1102.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.41 ms [16:03:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10Cmjohnson) 05Open→03Resolved @jcrespo I found a DIMM replacement in a decom'... [16:06:25] (03PS1) 10Arturo Borrero Gonzalez: ceph::auth::keyring: ensure parent directory tree exists [puppet] - 10https://gerrit.wikimedia.org/r/745564 [16:07:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add damilare to icinga - https://phabricator.wikimedia.org/T297383 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [16:09:40] (03PS3) 10Cathal Mooney: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) [16:11:44] (03CR) 10jerkins-bot: [V: 04-1] Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [16:14:21] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/32938/" [puppet] - 10https://gerrit.wikimedia.org/r/745564 (owner: 10Arturo Borrero Gonzalez) [16:14:30] (03CR) 10MSantos: [C: 03+1] kartographer: Enable tegola on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745558 (owner: 10Jgiannelos) [16:14:47] (03PS1) 10Jcrespo: Revert "mariadb: Reduce memory allocation for dbs at db1102 due to hw failure" [puppet] - 10https://gerrit.wikimedia.org/r/745380 [16:16:40] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Reduce memory allocation for dbs at db1102 due to hw failure" [puppet] - 10https://gerrit.wikimedia.org/r/745380 (owner: 10Jcrespo) [16:16:44] (03PS4) 10Cathal Mooney: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) [16:17:36] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:29] (03CR) 10jerkins-bot: [V: 04-1] Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [16:20:12] (03PS5) 10Cathal Mooney: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) [16:20:58] (03PS1) 10Arturo Borrero Gonzalez: ceph::mgr: fix missing parent directory tree [puppet] - 10https://gerrit.wikimedia.org/r/745565 [16:22:13] (03CR) 10jerkins-bot: [V: 04-1] Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [16:22:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:32] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:35] (03CR) 10Arturo Borrero Gonzalez: "This patch is an alternative to https://gerrit.wikimedia.org/r/c/operations/puppet/+/745564" [puppet] - 10https://gerrit.wikimedia.org/r/745565 (owner: 10Arturo Borrero Gonzalez) [16:23:54] (03CR) 10Arturo Borrero Gonzalez: "Alternative to this change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/745565" [puppet] - 10https://gerrit.wikimedia.org/r/745564 (owner: 10Arturo Borrero Gonzalez) [16:27:31] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) [16:28:56] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, and 5 others: Refactor puppet:base module to reduce unneeded shared code paths - https://phabricator.wikimedia.org/T289661 (10jbond) 05Open→03Resolved this is complete [16:29:31] godog: not sure if you've a moment. The unit test setup on that alert is defeating me :( [16:29:47] tried a variety of random things, and reading online but I'm still muddy on what is going on [16:29:58] https://gerrit.wikimedia.org/r/c/operations/alerts/+/745563 [16:30:05] Emperor and legoktm: My dear minions, it's time we take the moon! Just kidding. Time for Swift config update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1630). [16:31:16] topranks: sure, I'm about to enter a meeting, will take a look later [16:31:31] yeah no worries it's not in any way urgent. thanks. [16:31:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/745565 (owner: 10Arturo Borrero Gonzalez) [16:34:16] (03CR) 10Jbond: "this looks like an alternate to 745565 but i think you could use both" [puppet] - 10https://gerrit.wikimedia.org/r/745564 (owner: 10Arturo Borrero Gonzalez) [16:40:17] (03PS6) 10Cathal Mooney: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) [16:42:15] (03CR) 10jerkins-bot: [V: 04-1] Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [16:44:38] (03PS7) 10Cathal Mooney: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) [16:46:54] (03CR) 10jerkins-bot: [V: 04-1] Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [16:51:28] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 723 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:53:40] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) I create Order # 1-214270167279 to be on site next week on the 14th at 3:00 PM to meet with the Equinix smart hands tech to perform the troubleshooting while i am on site. [16:54:15] !log mvernon@deploy1002 Synchronized private/PrivateSettings.php: Update swift config T296767 (duration: 01m 05s) [16:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:30] !log remove restbase certificates and configuration entries for decommissioned hosts [16:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:46] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 6 probes of 723 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:00:04] Emperor and legoktm: (Dis)respected human, time to deploy Swift config update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1630). Please do the needful. [17:00:04] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:06] !log stop kafka* on kafka-main2003 as pre-step before reimaging [17:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:08] going to wait 5/10 mins before reimaging, to see if any issue arise from purged or eventgate [17:03:54] <_joe_> ack [17:08:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph::mgr: fix missing parent directory tree [puppet] - 10https://gerrit.wikimedia.org/r/745565 (owner: 10Arturo Borrero Gonzalez) [17:08:49] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/745565 instead" [puppet] - 10https://gerrit.wikimedia.org/r/745564 (owner: 10Arturo Borrero Gonzalez) [17:12:43] ok all good, proceeding [17:15:10] (03PS2) 10Arturo Borrero Gonzalez: ceph::auth::keyring: ensure parent directory tree exists [puppet] - 10https://gerrit.wikimedia.org/r/745564 [17:15:44] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main2003.codfw.wmnet with OS buster [17:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph::auth::keyring: ensure parent directory tree exists [puppet] - 10https://gerrit.wikimedia.org/r/745564 (owner: 10Arturo Borrero Gonzalez) [17:19:17] 10SRE-swift-storage: Swift proxies need a /usr/local/sbin/restart-swift-proxies wrapper - https://phabricator.wikimedia.org/T297413 (10MatthewVernon) [17:21:01] Emperor legoktm nice work re: swift mw credentials rotation! [17:21:34] Emperor did all the work, I just watched :) [17:22:29] heheh [17:23:32] topranks: ok! I'll take a look tomorrow since it isn't urgent, I've added myself to the review so I don't forget [17:23:57] cool... enjoy your evening :) [17:24:59] cheers! you too [17:30:01] 10SRE, 10SRE-swift-storage, 10Security-Team, 10SecTeam-Processed, 10Security: Rotate swift auth key for mw:media account - https://phabricator.wikimedia.org/T296767 (10sbassett) [17:30:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1700). [17:30:05] No Gerrit patches in the queue for this window AFAICS. [17:30:05] cwhite: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Deploy OpenSearch. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1730). [17:30:22] jouncebot: what, again? [17:30:40] 10SRE, 10SRE-swift-storage, 10Security-Team, 10SecTeam-Processed, 10Security: Rotate swift auth key for mw:media account - https://phabricator.wikimedia.org/T296767 (10sbassett) [17:31:38] oh weird, those are overlapping windows now? I could have sworn the puppet window was only 30 minutes [17:34:10] hello folks, like all good reimage stories kafka-main2003 seems not able to PXE boot correctly in d-i [17:34:59] (03Abandoned) 10Jdlrobson: Set higher specificity [extensions/MobileFrontend] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745241 (https://phabricator.wikimedia.org/T171726) (owner: 10Jdlrobson) [17:36:37] elukey: which story structure are you?_https://www.google.com/search?q=story+structure&sxsrf=AOaemvKRF5mYbaqGQzay8KoF4HHKtHipyA:1639071368905&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiX8tDVoNf0AhVJUt8KHf_nD-4Q_AUoAXoECAEQAw&biw=1462&bih=918&dpr=2 [17:36:40] https://www.google.com/search?q=story+structure&sxsrf=AOaemvKRF5mYbaqGQzay8KoF4HHKtHipyA:1639071368905&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiX8tDVoNf0AhVJUt8KHf_nD-4Q_AUoAXoECAEQAw&biw=1462&bih=918&dpr=2 [17:37:14] ottomata: you are not helping :D :D [17:37:37] it'll help you later when you write your memoir [17:40:08] (03PS1) 10Ahmon Dancy: Allow scap/files/scap-master-sync to include CDB files [puppet] - 10https://gerrit.wikimedia.org/r/745572 (https://phabricator.wikimedia.org/T297326) [17:40:17] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2003.codfw.wmnet with OS buster [17:40:17] elukey: I think the reimage series is getting boring at this stage :) [17:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:28] Do you know where it's getting stuck? [17:40:38] Or I can have a look (no expert mind)? [17:41:38] topranks: thanks for the help! After a chat with Riccardo it may be something that needs a firmware upgrade, so I'll boot the node again with the old os and renew its puppet cert :( [17:41:48] (03CR) 10Cwhite: [C: 03+2] hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/745284 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [17:42:06] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10RLazarus) >>! In T297265#7558906, @fgiunchedi wrote: > it looks like either a kernel bug (we've never run graphite with `5.10.0-8-amd64` So it sounds like... [17:42:13] elukey: ok fair enough, I would bow to volan.s superior knowledge here :) [17:42:51] topranks: it gets stuck after pxe while loading d-i .gz files, never seen it before [17:42:57] (then auto-reboots after a bit) [17:44:08] godog, moritzm: n.b. https://phabricator.wikimedia.org/T297265#7560598 on the ongoing graphite kernel situation [17:44:36] not sure if you're still up but just in case :) would be good to know what to do [17:46:13] elukey: I see it's on BIOS firmware 1.7, I think the latest we commonly use is 2.12.2, so I think Riccardo's suggestion is probably the first place to start yeah. [17:46:21] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for kafka-main2003.codfw.wmnet: Renew puppet certificate - elukey@cumin1001 [17:46:22] !log elukey@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for kafka-main2003.codfw.wmnet: Renew puppet certificate - elukey@cumin1001 [17:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:38] !log point kibana7 to OpenSearch in codfw T288621 [17:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:43] T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch - https://phabricator.wikimedia.org/T288621 [17:49:08] (03PS1) 10Jdlrobson: Remove broken symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) [17:57:31] (03PS8) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) [17:57:33] (03PS10) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [17:59:52] (03PS1) 10Jbond: puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 [18:00:04] cwhite: My dear minions, it's time we take the moon! Just kidding. Time for Deploy OpenSearch deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1730). [18:00:05] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1800). [18:00:28] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 (owner: 10Jbond) [18:01:37] (03PS2) 10Jbond: puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 [18:05:40] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10MoritzMuehlenhoff) >>! In T297265#7560598, @RLazarus wrote: >>>! In T297265#7558906, @fgiunchedi wrote: >> it looks like either a kernel bug (we've never ru... [18:06:16] (03PS3) 10Jbond: puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 [18:06:35] (03PS4) 10Jbond: puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 [18:06:40] (03PS2) 10Bartosz Dziewoński: VE on zh.wiki: Enable SET mode but defaulting to multi-tab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745546 (https://phabricator.wikimedia.org/T296269) [18:06:44] (03CR) 10Andrew Bogott: [C: 03+2] openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [18:07:15] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 (owner: 10Jbond) [18:09:05] (03PS4) 10Majavah: openstack: enc: properly fail on server error [puppet] - 10https://gerrit.wikimedia.org/r/742424 [18:10:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:50] (03CR) 10Andrew Bogott: [C: 03+2] openstack: enc: properly fail on server error [puppet] - 10https://gerrit.wikimedia.org/r/742424 (owner: 10Majavah) [18:12:12] (03PS5) 10Jbond: puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 [18:12:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission thorium.eqiad.wmnet - https://phabricator.wikimedia.org/T292075 (10Cmjohnson) [18:13:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission thorium.eqiad.wmnet - https://phabricator.wikimedia.org/T292075 (10Cmjohnson) 05Open→03Resolved [18:13:12] (03CR) 10Jbond: [C: 03+2] puppet_compiler: Add concept of realms [puppet] - 10https://gerrit.wikimedia.org/r/745575 (owner: 10Jbond) [18:13:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: Decommission old MSW in racks A, B, C and D - https://phabricator.wikimedia.org/T296770 (10Cmjohnson) 05Open→03Resolved all old switches removed and netbox updated [18:13:56] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10Cmjohnson) [18:14:27] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10Cmjohnson) 05Open→03Resolved removed from rack and netbox updated. [18:14:48] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10Cmjohnson) [18:15:24] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10Cmjohnson) 05Open→03Resolved removed from rack and updated netbox [18:15:30] !log kafka-main2003 back in service with the old OS (stretch). Re-created a new puppet host key and signed it on the puppet master [18:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Cmjohnson) [18:16:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Cmjohnson) idracs are setup, neet f/w update and OS install [18:16:24] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:17:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:56] (03PS1) 10Jbond: C:puppet_compiler::uploader: fix ip type [puppet] - 10https://gerrit.wikimedia.org/r/745577 [18:19:20] 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10elukey) [18:19:45] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler::uploader: fix ip type [puppet] - 10https://gerrit.wikimedia.org/r/745577 (owner: 10Jbond) [18:26:53] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:31:46] (03PS3) 10Bartosz Dziewoński: VE on zh.wiki: Enable SET mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745546 (https://phabricator.wikimedia.org/T296269) [18:33:55] PROBLEM - graphite.wikimedia.org render on graphite2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [18:33:59] PROBLEM - graphite.wikimedia.org api on graphite2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [18:37:16] (03PS1) 10Jbond: C:puppet_compiler: update default verion to 2.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/745579 [18:38:04] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: update default verion to 2.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/745579 (owner: 10Jbond) [18:38:37] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:13] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:34] (03PS4) 10Bartosz Dziewoński: VE on zh.wiki: Enable single-edit-tab mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745546 (https://phabricator.wikimedia.org/T296269) [18:47:47] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:50:15] PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:01] !log powercycle graphite2003 T297265 [18:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:06] T297265: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 [18:53:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:16] (PingOffloadMissingIP) firing: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [18:54:18] (03PS1) 10Jbond: puppet_compiler: notify uwsgi when file canges [puppet] - 10https://gerrit.wikimedia.org/r/745580 [18:54:21] RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:20] PROBLEM - Host graphite2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:30] RECOVERY - Host graphite2003 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [18:55:33] PROBLEM - carbon-frontend-relay service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:55:49] PROBLEM - carbon-cache@c service on graphite2003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:56:02] o/ [18:56:17] ^^ I just powercycled graphite2003 because it hung [18:56:18] <_joe_> i guess that was the powercycle by cwhite [18:56:23] argh, here [18:56:24] <_joe_> yep [18:56:33] should be back up now [18:56:46] I'm here too [18:56:55] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:01] le sigh graphite2003 [18:57:11] RECOVERY - graphite.wikimedia.org render on graphite2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1634 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [18:57:17] RECOVERY - graphite.wikimedia.org api on graphite2003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [18:57:24] I suppose this means we've eliminated the hardware possiblility [18:57:24] (03PS2) 10Jbond: puppet_compiler: notify uwsgi when file changes [puppet] - 10https://gerrit.wikimedia.org/r/745580 [18:57:28] so okay, whatever the problem was with 5.10.46 on graphite1004, we have the same situation on 2003 [18:57:36] (see updates on T297265 for context) [18:57:36] T297265: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 [18:57:39] indeed [18:57:43] RECOVERY - carbon-frontend-relay service on graphite2003 is OK: OK - carbon-frontend-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:57:48] m.oritzm's suggestion of upgrading the firmware sounds like a good one [18:57:57] RECOVERY - carbon-cache@c service on graphite2003 is OK: OK - carbon-cache@c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:58:10] (03CR) 10Jbond: [C: 03+2] puppet_compiler: notify uwsgi when file changes [puppet] - 10https://gerrit.wikimedia.org/r/745580 (owner: 10Jbond) [18:58:13] both are 2018 hosts. probably they both need it [18:58:29] 👍 [18:58:43] SGTM [18:58:51] RECOVERY - snapshot of s2 in eqiad on alert1001 is OK: Last snapshot for s2 at eqiad (db1139.eqiad.wmnet:3312) taken on 2021-12-09 16:39:23 (1054 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [18:58:52] I do also want to consider rolling back fully, as discussed on-task -- it's a big downgrade but we gotta get out of the situation we're in [18:58:55] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:59:01] can we upgrade the firmware on one host and downgrade the other just in case? [18:59:07] *downgrade the kernel on the other [18:59:34] yeah I think downgrading the kernel sounds good, I mean they ran fine before [18:59:36] I'm happy to go with whatever the o11y folks think is most reasonable [18:59:41] since the bullseye upgrade that is [18:59:50] hey, we have a mediwaiki backport window coming up, should this graphite thing block it? [19:00:05] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1900). [19:00:05] cjming, eigyan, mbsantos, mbsantos, and MatmaRex: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:14] o/ [19:00:16] hi [19:00:16] hey [19:00:23] RoanKattouw, urbanecm: at least stand by a moment while we sort this out please [19:00:23] hey [19:00:25] sorry for the delay [19:00:32] rzl: ack, please ping me when ready [19:00:40] pinging Seddon too since they apparently forgot to replace the nick when copy-pasting [19:00:53] yeah I'm for rolling back the kernel on graphite2003, cdanis cwhite ? [19:00:57] Ooops! [19:01:03] and cc/ nemo-yiannis [19:01:36] [prep for actual B&C to come] majavah: will you want to lead the window? (okay if not) [19:01:38] godog: the other question is do we want to do this now or later -- the host is back up for now so we could unblock the deploy, and do the work afterward [19:01:53] rzl: sure that works for me [19:01:59] but the downside is, no promises how long it will stay up :) and it's late in your day ofc [19:02:14] no worries I can be around [19:02:25] urbanecm: please do, I'm going to eat something now if there's a delay [19:02:34] okay [19:02:40] we should roll back the kernel [19:02:55] rzl: +1 to wait for deploy and then go ahead [19:03:05] cwhite: ^ that plan okay with you? [19:03:09] both sounds good to me [19:03:21] sgtm, no need to block backports imo [19:03:25] urbanecm: okay, fire away :) [19:03:30] thanks [19:03:40] give us a heads up when done please, the graphite work will leave a hole in mediawiki metrics for a while [19:03:44] will do [19:04:09] eigyan: hey, around? [19:04:30] hello [19:05:13] Seddon: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/745578 is not yet reviewed, so i don't think i can deploy that [19:05:21] (03PS3) 10Urbanecm: Deploy sticky header and A/B test enrollment to office, test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745285 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [19:05:27] (03CR) 10Urbanecm: [C: 03+2] Deploy sticky header and A/B test enrollment to office, test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745285 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [19:05:31] Alright I'll catch the next window! [19:05:56] sounds good, thanks [19:06:07] (03PS23) 10Urbanecm: [beta] Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [19:06:21] (03Merged) 10jenkins-bot: Deploy sticky header and A/B test enrollment to office, test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745285 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [19:07:28] cjming: your patch is at mwdebug1001, can you have a look? [19:07:34] yup [19:08:00] urbanecm: gtg! [19:08:14] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10MoritzMuehlenhoff) But in general; given that these hosts ran fine before with 5.10.70, we can also easily revert to that. The downgrade towards .46 was do... [19:08:24] syncing [19:09:05] (03PS24) 10Urbanecm: [beta] Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [19:09:10] (03CR) 10Urbanecm: [C: 03+2] [beta] Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [19:09:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c1e95519682f0bd6633fb9fd7f49e7a664ec9f87: Deploy sticky header and A/B test enrollment to office, test wikis (T295972) (duration: 01m 06s) [19:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:53] T295972: Deploy sticky header to office wiki and test wiki - https://phabricator.wikimedia.org/T295972 [19:09:53] (03Merged) 10jenkins-bot: [beta] Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [19:09:55] cjming: live [19:10:05] eigyan: your beta patch will be automatically deployed within next 30 minutes [19:10:07] if not, ping me :) [19:10:16] urbanecm: woohoo - thanks 🙌 [19:10:21] np [19:10:30] (03PS2) 10Urbanecm: kartographer: Enable tegola on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745558 (owner: 10Jgiannelos) [19:10:33] (03CR) 10Urbanecm: [C: 03+2] kartographer: Enable tegola on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745558 (owner: 10Jgiannelos) [19:10:41] thank you urbanecm [19:10:46] any time [19:11:19] (03Merged) 10jenkins-bot: kartographer: Enable tegola on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745558 (owner: 10Jgiannelos) [19:11:52] mbsantos: nemo-yiannis: can you test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/745558? It's available at mwdebug1001 now [19:12:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:12] urbanecm: lgtm [19:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:45] thanks, syncing [19:13:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:46] (03PS5) 10Urbanecm: VE on zh.wiki: Enable single-edit-tab mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745546 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:13:54] (03CR) 10Urbanecm: [C: 03+2] VE on zh.wiki: Enable single-edit-tab mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745546 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:14:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 06bd3e627cfef805f8b56be2b38b9125471b1410: kartographer: Enable tegola on frwiki (duration: 01m 05s) [19:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:57] (03Merged) 10jenkins-bot: VE on zh.wiki: Enable single-edit-tab mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745546 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:15:59] MatmaRex: your patch is at mwdebug1001 [19:16:20] looking [19:18:55] urbanecm: hmm, i realized i missed something in this patch, but i'll just submit another one. it's good to go [19:19:09] MatmaRex: do you want me to do the follow-up now too? [19:19:33] or should i just sync and let you wait for a next window? [19:19:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:52] urbanecm: i don't want to hold you up, and i'll need a while to figure it out [19:19:58] okay [19:20:00] in that case, syncing [19:20:06] so please sync, i'll try the next window [19:20:18] ok [19:20:24] note it's thursday, and there's only one left [19:20:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4e6cba0d3590446bb02815b65ba1c4ae9ed7bfac: VE on zh.wiki: Enable single-edit-tab mode (T296269) (duration: 01m 05s) [19:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:29] T296269: Enable VisualEditor for Chinese Wikipedia - https://phabricator.wikimedia.org/T296269 [19:21:35] MatmaRex: live [19:21:49] thanks [19:21:56] np [19:21:59] godog: over to you [19:22:09] urbanecm: ack [19:23:09] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [19:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:18] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10ops-monitoring-bot) Host rebooted by filippo@cumin1001 with reason: None [19:25:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:27:39] that looks like a spike so far [19:29:24] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet [19:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:02] (03PS1) 10Jbond: pcc_uploader: improve error messaging [puppet] - 10https://gerrit.wikimedia.org/r/745584 [19:30:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:31:17] (03CR) 10Jbond: [C: 03+2] pcc_uploader: improve error messaging [puppet] - 10https://gerrit.wikimedia.org/r/745584 (owner: 10Jbond) [19:33:44] (03PS1) 10Ladsgroup: Fix the mistake in passing parameter [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745383 (https://phabricator.wikimedia.org/T296380) [19:36:58] I'm deploying a security patch for T297416 [19:38:23] (03PS1) 10Andrew Bogott: cloudbackup1002-dev: mark as spare for now [puppet] - 10https://gerrit.wikimedia.org/r/745588 (https://phabricator.wikimedia.org/T295584) [19:39:49] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup1002-dev: mark as spare for now [puppet] - 10https://gerrit.wikimedia.org/r/745588 (https://phabricator.wikimedia.org/T295584) (owner: 10Andrew Bogott) [19:40:34] !log deployed patch for T297416 [19:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:12] did the firmware update already happen on the graphite hosts? I'm in support of doing it, but sequence wise it would be good to avoid changing multiple things at the same time [19:44:56] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) I've rolled back graphite2003 to `5.10.0-9-amd64`, next steps as per IRC convo are to wait for graphite2003' stability, and consider upgrading f... [19:45:01] herron: no not yet, just a rollback of graphite2003 [19:45:07] kk [19:45:07] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:45:57] going back afk but I'm available in case sth happens [19:49:25] later godog, thanks [19:52:13] herron: agreed re multiple things at the same time -- the proposal was to change one thing on each host [19:52:24] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Ladsgroup) [19:53:55] (03PS1) 10Majavah: P::openstack: add back puppetmaster_ca [puppet] - 10https://gerrit.wikimedia.org/r/745591 [19:54:15] !log deployed patch for T297416 [19:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:58] jouncebot: nowandnext [19:55:58] For the next 0 hour(s) and 4 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T1900) [19:55:58] In 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T2000) [19:56:18] not a good time then :) [19:57:25] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:58:09] (03CR) 10Andrew Bogott: [C: 03+2] P::openstack: add back puppetmaster_ca [puppet] - 10https://gerrit.wikimedia.org/r/745591 (owner: 10Majavah) [20:00:05] dancy and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T2000). [20:00:16] o/ [20:00:20] o/ [20:00:45] legoktm: Clear for the train to proceed? [20:01:25] I believe so [20:01:40] Alright. Pressing the buttons. [20:02:03] (03PS1) 10Ahmon Dancy: group2 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745594 [20:02:07] rzl: thanks, saw some chat above about both hosts needing updates and wanted to double check. sgtm [20:02:07] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745594 (owner: 10Ahmon Dancy) [20:03:06] (03Merged) 10jenkins-bot: group2 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745594 (owner: 10Ahmon Dancy) [20:04:19] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.38.0-wmf.12 refs T293953 [20:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:24] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [20:07:23] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:07:28] 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10ABorbaWMF) Appears to be fixed on 6.8.2 (1868) [20:07:33] PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:14] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10RLazarus) LGTM, ship it -- I thought he was already in there! [20:11:28] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Majavah) Support! [20:11:52] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Urbanecm) This looks like a good idea to me. +1. [20:20:20] (03PS1) 10Clare Ming: Update WebABTestEnrollment name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745598 (https://phabricator.wikimedia.org/T295972) [20:21:16] (03PS1) 10Jbond: WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [20:21:52] (03CR) 10jerkins-bot: [V: 04-1] WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [20:24:21] (03CR) 10Nray: [C: 03+1] Update WebABTestEnrollment name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745598 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [20:24:33] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:01] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:26:09] RECOVERY - dump of s2 in eqiad on alert1001 is OK: Last dump for s2 at eqiad (db1139.eqiad.wmnet:3312) taken on 2021-12-09 18:57:54 (125 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [20:26:13] RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:50] 10ops-eqiad: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10colewhite) [20:37:19] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10colewhite) [20:37:23] 10ops-eqiad: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10colewhite) [20:45:59] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:48:24] 10SRE, 10ops-eqiad: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10Cmjohnson) @colewhite we can update the f/w but this will require the server to be out of production for about 30 minutes. We can do this almost anytime, let me know when you would like to... [20:51:38] (03CR) 10Jdlrobson: [C: 03+1] "I assume the plan is to backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/745586 at the same time to 1.38.0-wmf.12 ? if s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745598 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [20:53:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson kubernetes1022 B6 U35 cableid#3964 Port34 [20:53:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Jclark-ctr) [20:53:23] (03PS1) 10JHathaway: mirror - wip [puppet] - 10https://gerrit.wikimedia.org/r/745606 [20:55:15] 10SRE, 10ops-eqiad: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10colewhite) I think it just needs to be downtimed in icinga for the maintenance window. Being the backup host, I think you can proceed when you're ready. [20:58:07] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:29] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:58:31] (03PS1) 10Clare Ming: Update A/B test enrollment name [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745607 (https://phabricator.wikimedia.org/T292587) [21:10:32] (03CR) 10Nray: [C: 03+1] Update A/B test enrollment name [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745607 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [21:13:29] (03CR) 10Clare Ming: Update WebABTestEnrollment name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745598 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [21:15:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:33] (03CR) 10Jdlrobson: [C: 03+1] Update A/B test enrollment name [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745607 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [21:39:45] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/745501 (https://phabricator.wikimedia.org/T297282) (owner: 10Hnowlan) [21:40:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson ganeti1025 a2 u41 cableid#1208202101 port36 ganeti1026 a7 u13 cableid#1208202102 port35 ganeti1027 c... [21:40:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Jclark-ctr) [21:42:55] (03CR) 10Eevans: [C: 03+1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [21:52:28] (03CR) 10Dzahn: [C: 03+2] Allow scap/files/scap-master-sync to include CDB files [puppet] - 10https://gerrit.wikimedia.org/r/745572 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [21:53:33] 10SRE, 10DBA, 10Data-Persistence, 10observability, 10User-Ladsgroup: Send metrics of db errors of mediawiki to promethues - https://phabricator.wikimedia.org/T297435 (10Ladsgroup) [21:54:06] 10SRE, 10DBA, 10Data-Persistence, 10observability, 10User-Ladsgroup: Send metrics of db errors of mediawiki to promethues - https://phabricator.wikimedia.org/T297435 (10Ladsgroup) p:05Triage→03Medium [21:54:48] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Ladsgroup) I think this can be simply sent by mediawiki which would be much easier. If that's acceptable for you, then... [21:56:59] jouncebot: nowandnext [21:56:59] For the next 0 hour(s) and 3 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211209T2000) [21:56:59] In 2 hour(s) and 3 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211210T0000) [21:57:14] (03CR) 10Ladsgroup: [C: 03+2] Fix the mistake in passing parameter [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745383 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [21:58:39] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Legoktm) 05Open→03Resolved a:03Legoktm Done. [21:59:59] (03CR) 10Dzahn: "deployed on deploy1001. the default should not have changed (NOT syncing CDB files by default)" [puppet] - 10https://gerrit.wikimedia.org/r/745572 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [22:00:30] (03CR) 10Ahmon Dancy: Allow scap/files/scap-master-sync to include CDB files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745572 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [22:01:11] (03CR) 10Dzahn: contint: delete the proxy_gerrit class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [22:01:43] (03Merged) 10jenkins-bot: Fix the mistake in passing parameter [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745383 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [22:05:55] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:745383|Fix the mistake in passing parameter (T296380)]] (duration: 02m 11s) [22:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:00] T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380 [22:06:12] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/32943/" [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [22:06:18] (03PS2) 10Dzahn: contint: delete the proxy_gerrit class [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559) [22:06:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson cloudbackup1003 c8 u37 cableid#1208202105 port#41 cloudbackup1004 d5 u35 cableid#1208202106 p... [22:09:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Jclark-ctr) [22:14:01] (03CR) 10Dzahn: "noop on contint1001 and gerrit1001" [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [22:14:10] !log dancy@deploy1002 Synchronized README: testing https://gerrit.wikimedia.org/r/745572 (duration: 00m 55s) [22:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:51] (03CR) 10Dzahn: [C: 03+2] "ok, thanks again for the detailed response, Antoine. I'll ship it." [puppet] - 10https://gerrit.wikimedia.org/r/744839 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [22:15:59] (03PS1) 10JHathaway: debian mirror: add new mirror, copernicium [puppet] - 10https://gerrit.wikimedia.org/r/745612 [22:20:08] (03CR) 10Dzahn: [C: 03+1] "ah, cool. is this codfw or an upgrade of eqiad?" [puppet] - 10https://gerrit.wikimedia.org/r/745612 (owner: 10JHathaway) [22:20:42] (03CR) 10Dzahn: [C: 03+1] "nitpick: a ticket to link to in the footer would be nice" [puppet] - 10https://gerrit.wikimedia.org/r/745612 (owner: 10JHathaway) [22:30:21] (03CR) 10Cwhite: [C: 03+1] prometheus: remove job unavailable alert [puppet] - 10https://gerrit.wikimedia.org/r/744035 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [22:31:39] (03CR) 10Cwhite: [C: 03+1] team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [22:33:26] (03CR) 10Cwhite: [C: 03+1] alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [22:33:35] (03PS2) 10JHathaway: debian mirrors: add new mirror, copernicium in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) [22:34:23] (03CR) 10Cwhite: [C: 03+1] prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [22:39:49] RECOVERY - snapshot of s3 in eqiad on alert1001 is OK: Last snapshot for s3 at eqiad (db1145.eqiad.wmnet:3313) taken on 2021-12-09 21:01:31 (1171 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:43:45] (03CR) 10Dzahn: "per IRC chat, let's see first if we go "mirror1001"" [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [22:53:16] (PingOffloadMissingIP) firing: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [23:25:02] 10SRE, 10DBA, 10observability, 10User-Ladsgroup: Send metrics of db errors of mediawiki to promethues - https://phabricator.wikimedia.org/T297435 (10Ladsgroup) [23:30:00] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10nshahquinn-wmf) Data Engineering folks, this ticket needs some input from you 😊 >>! In T252227#6156179, @BBlack wrote: > Before we go all the way down that path, we should... [23:31:59] (03PS1) 10Esanders: Revert "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745385 [23:33:13] (03PS2) 10Esanders: Revert "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745385 (https://phabricator.wikimedia.org/T296269) [23:52:33] (03CR) 10Dzahn: "> This method of applying classes makes sense for production but it" [puppet] - 10https://gerrit.wikimedia.org/r/745286 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [23:53:48] (03CR) 10Dzahn: "thanks Ladsgroup (And DBrant)!:) https://phabricator.wikimedia.org/T133183#7554509" [puppet] - 10https://gerrit.wikimedia.org/r/745287 (https://phabricator.wikimedia.org/T133183) (owner: 10Ladsgroup) [23:55:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [23:57:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >>! In T272559#7556709, @Ladsgroup wrote: >>> Maybe we should find a better place to put the class in? Agreed! I find this way to include classes unu... [23:58:50] (03CR) 10Dzahn: "@Jbond from my side it's good "as long as we think it's easy enough to reset the pass on everything if the need should ever arise again (w" [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [23:59:15] (03CR) 10Dzahn: [C: 03+2] diamond: delete collector::servicestats* [puppet] - 10https://gerrit.wikimedia.org/r/744841 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn)