[09:09:00] FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:09:25] hmmm that alarm is missing a dashboard link [09:14:00] RESOLVED: [3x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:18:34] and it looks like eqiad<-->eqsin transient communication issues [09:34:00] FIRING: [2x] PurgedHighEventLag: High event process lag with purged on cp5020:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:39:00] FIRING: [19x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:44:00] RESOLVED: [24x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:47:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp5031:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5031 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:52:00] RESOLVED: [2x] PurgedHighEventLag: High event process lag with purged on cp5024:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:00:28] 10netops, 06Traffic, 06Infrastructure-Foundations, 06serviceops: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#10076853 (10Vgutierrez) A quick test using IPVS maglev implementation with mh-port flag enabled (to include the source port as part of the load bal... [10:29:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp5023:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5023 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:34:00] FIRING: [9x] PurgedHighEventLag: High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:37:24] XioNoX, topranks any ongoing maintenance impacting eqiad<->eqsin communications? https://grafana.wikimedia.org/goto/hPzkhlCSg?orgId=1 [10:39:00] FIRING: [10x] PurgedHighEventLag: High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:41:44] vgutierrez: looking [10:43:15] purged alerts started soon after the base latency increase at ~9am [10:44:00] FIRING: [7x] PurgedHighEventLag: High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:44:12] usually with this it's some routing change over Aerlion's MPLS core [10:44:31] latency across that link is definitely higher than the baseline before, but still lower than what that graph is showing [10:44:34] https://www.irccloud.com/pastebin/qPO2wtpD/ [10:47:07] vgutierrez: it's not a routing change, link is saturating :( [10:47:07] https://grafana-rw.wikimedia.org/d/f61a7d56-e132-44dc-b9da-d722b11566cf/network-totals-by-site?orgId=1&refresh=30s&var-site=eqsin%20prometheus%2Fops [10:47:25] I'll shift some traffic onto the GTT path [10:49:00] RESOLVED: [11x] PurgedHighEventLag: High event process lag with purged on cp5018:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:52:56] my bad - there is no GTT in Singapore :( [10:53:48] only alternate path is via ulsfo over NTT, but not much BW on that either :( [10:54:26] vgutierrez: I'll see if I can balance some traffic that way, but either way we need to work out what's causing that usage [10:58:22] Coming from AWS [10:58:31] AS16509 [11:02:35] UA seems to be "pyWikiCommons/0.0.5 (https://github.com/amckenna41/pyWikiCommons; root)" [11:22:39] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 (10Clement_Goubert) 03NEW [11:25:27] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10077064 (10Clement_Goubert) p:05Triage→03High [11:26:00] FIRING: [8x] PurgedHighEventLag: High event process lag with purged on cp5019:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:31:00] RESOLVED: [28x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:41:52] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10077137 (10Clement_Goubert) From what I can gather the automation is there with the `--move-vlan` option to the reimage cookbook, I th... [12:20:11] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10077212 (10ayounsi) > I need to check that the physical cabling changes are ok before we start Physical cabling is on the new switches... [14:31:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10077810 (10VRiley-WMF) @ayounsi I've checked the device and there doesn't seem to be any failure notifications (Physically anyway). Would it be possible to open up a RMA or Su... [14:43:25] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10077862 (10cmooney) >>! In T372781#10077810, @VRiley-WMF wrote: > @ayounsi I've checked the device and there doesn't seem to be any failure notifications (Physically anyway).... [14:44:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10077866 (10VRiley-WMF) Sounds like a plan. Thank you! I will be at the ready. [15:02:33] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10077948 (10cmooney) 05Open→03Resolved >>! In T370862#10035781, @Papaul wrote: > @cmooney links removed. Y... [15:14:39] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2291 to... [15:15:17] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [15:17:14] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078014 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:28:38] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [15:57:27] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox ProvisionServer script fails vlan verification - https://phabricator.wikimedia.org/T372654#10078233 (10cmooney) The above patch will prevent this causing an issue when we follow the normal workflow - selecting a vlan 'type' (public/privat... [16:00:23] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox ProvisionServer script fails vlan verification - https://phabricator.wikimedia.org/T372654#10078237 (10cmooney) 05Open→03Resolved a:03cmooney [16:16:06] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [16:25:29] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078401 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=24f68f00-c864-474e-a3e6-c044aab86afa) set by... [16:25:41] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078403 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=612388e5-b8df-408f-81be-6f237cee6e7c) set by... [16:50:17] 10Wikimedia-Apache-configuration, 06collaboration-services, 10Phabricator, 10Release-Engineering-Team (Priority Backlog 📥), and 3 others: Apache 2.4.61 throws a 403 Forbidden for links containing %3F - https://phabricator.wikimedia.org/T370110#10078534 (10brennen) 05Open→03Resolved a:03brennen [16:56:22] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host lvs2014.co... [18:04:03] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs2014.codfw.... [18:42:39] 10netops, 06Infrastructure-Foundations, 06SRE: PuppetDB import failing for lvs2014 - https://phabricator.wikimedia.org/T372931 (10cmooney) 03NEW [18:46:04] 10netops, 06Infrastructure-Foundations, 06SRE: PuppetDB import failing for lvs2014 - https://phabricator.wikimedia.org/T372931#10078999 (10ssingh) As another data point, we most certainly have not reimaged any LVS host //after// the Netbox migration was finished. So yeah, it might be related to that.