[05:31:27] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141772 (10Marostegui) So this is almost confirmed related to atop. I killed it yesterday at around 14:30 and it was remained stopped till 00:00 (where it started automatic... [05:35:32] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141775 (10Marostegui) RX buffers reverted ``` root@db1114:~# ethtool -g eno1 Ring parameters for eno1: Pre-set maximums: RX: 2047 RX Mini: 0 RX Jumbo: 0 TX: 511 Current... [07:53:21] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4141874 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4006.ulsfo.wmnet ``` The log can be found in `/var/lo... [08:20:17] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4141972 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4006.ulsfo.wmnet'] ``` and were **ALL** successful. [08:47:38] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142017 (10Vgutierrez) [08:49:05] https://grafana.wikimedia.org/dashboard/db/load-balancers?panelId=3&fullscreen&orgId=1&from=1524127200000&to=1524127500000 --> lvs4006 coming back to life as stretch :) [08:55:08] nice! [08:55:58] elukey: I've amended https://gerrit.wikimedia.org/r/#/c/425550/ for the analytics_{a,b} split and merged it. Forced a puppet run on the kafka hosts as well as einsteinium, we're looking good [09:00:50] ema: nice! [09:00:53] thanks a lot [09:26:31] 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4142112 (10fgiunchedi) I've taken a first stab at reporting uncorrectable errors in https://gerrit.wikimedia.org/r/c/422110/ as reported by the kernel, so at least... [09:34:32] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142137 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs4005.ulsfo.wmnet ``` The log can be found in `/var/lo... [10:15:36] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs4005.ulsfo.wmnet'] ``` and were **ALL** successful. [10:34:48] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4142329 (10Vgutierrez) [10:47:28] ulsfo done, let's see how it behaves :) [11:14:29] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142413 (10Marostegui) No more errors for the last 6 hours after killing atop. Also no drops or connections errors running the RX original buffers after reverting them as c... [12:39:22] https://grafana.wikimedia.org/dashboard/db/load-balancers?orgId=1 --> I just fixed the packets and bytes graphs to show proper units (packets/sec and bytes/sec) [12:39:38] OCD happiness++ [12:40:26] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127027 (10BBlack) >>! In T191996#4139205, @Marostegui wrote: > For the record, the irq for eno1 is balanced across CPUs, so I don't think it is the bottleneck here: > ```... [13:22:50] thank you! [13:23:17] i can get really cranky when I see graphs without proper units ;) [13:33:53] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142669 (10Marostegui) >>! In T191996#4142547, @BBlack wrote: > > Not that it's probably the issue here, but this probably isn't ideal. If you look at `grep eno1 /proc/in... [13:40:32] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142677 (10Marostegui) 05Open>03Resolved a:03Marostegui So, as soon as I started atop, errors came back and packets dropped. So the culprit is clearly `atop`. I am go... [13:43:12] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142681 (10Marostegui) [14:01:08] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129002 (10Marostegui) [14:42:07] 10Traffic, 10Operations, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4142839 (10Vgutierrez) p:05Triage>03Normal [15:08:47] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4142938 (10JBennett) What's the data? From our clicktracking efforts what will we be collecting? [15:10:30] 10Traffic, 10Operations, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559#4142948 (10Vgutierrez) [15:13:12] 10Traffic, 10Operations, 10Goal: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0 - https://phabricator.wikimedia.org/T192559#4142965 (10Vgutierrez) p:05Triage>03Normal [15:52:46] 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4143132 (10CCogdill_WMF) We're collecting click engagement off fundraising emails (actual fundraising appeals, or informational newsletter emails) that... [18:17:15] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4143719 (10ayounsi) [18:18:33] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4062522 (10ayounsi) Verified that external monitoring doesn't do ping checks (but http, etc. instead) to hostnames (en.wikipedia.org, etc). Added a Watchmouse ping check for...