[06:00:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:05:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:19:56] (HAProxyEdgeTrafficDrop) firing: 67% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:24:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:21:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 43.66545296106766% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [07:21:56] (HAProxyEdgeTrafficDrop) firing: 40% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:26:16] (VarnishTrafficDrop) resolved: (7) Varnish traffic in drmrs has dropped 54.87109861457756% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [07:26:56] (HAProxyEdgeTrafficDrop) resolved: (5) 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:47:23] greetings, following up on T320627 from yesterday, I'll proceed with a rolling restart of pybal in ulsfo, sounds good ? [07:47:24] T320627: Alert on individual pybal backend hosts being down for a long time - https://phabricator.wikimedia.org/T320627 [09:20:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10ayounsi) Access security zone, DHCP server, NAT config removed from the routers. New DHCP relay feature enabled instead of the old bootp one. Netbo... [09:24:36] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) I'm trying to capture this project also in https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProp... [09:27:23] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10taavi) >>! In T314847#8326550, @cmooney wrote: >>> /32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the s... [09:38:19] vgutierrez or brett maybe ^ ? re: roll restart of pybal to pick up latest changes [09:41:10] :? [09:41:30] godog: why do you need a rolling restart? [09:41:38] to clean cp4021 and cp4027? [09:42:06] vgutierrez: yeah that's right, I'll start with ulsfo and then do other sites too for the same reason [09:42:30] hmmm that would be a problem [09:42:54] so to keep this new check happy we are requiring a pybal restart every time that a host is decomm'ed? [09:45:05] mmhh I see what you mean, I thought it was more of a bug that pybal doesn't clean up its monitor(s) when an host disappears from etcd [09:45:39] pybal clean its monitors [09:45:43] and stops monitoring the host [09:46:03] what it doesn't clean for some reason is the prometheus metric [09:46:31] ah got it, ok! [09:47:36] sigh ok I'll dig into it a little bit [09:47:59] but yeah in theory no there shouldn't be a pybal forced restart on decom [10:00:59] ok that makes sense, metrics are not cleaned up on stop() in the monitor [10:05:03] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10cmooney) >>! In T314847#8328277, @taavi wrote: > Your comment was written in a way that made me understand that everything used in cod... [10:05:24] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) >>! In T314847#8328277, @taavi wrote: >>>! In T314847#8328272, @aborrero wrote: >> HAproxy uses LVS/ipvsadm for them under t... [10:50:41] i imagine that wouldn't be hard to fix [11:48:25] yeah doesn't seem too hard so far, I hope the thread I'm pulling isn't very long / has many things attached :) [12:06:00] oh you'll be the defacto pybal maintainer before you know it [12:10:57] haha! funny because it is true [12:10:59] https://gfycat.com/equatorialpleasedaustralianshelduck [13:06:06] there https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/844469 [13:06:18] it wasn't too bad indeed [13:32:51] 10netops, 10Infrastructure-Foundations: Ramp up SV1 IXP - https://phabricator.wikimedia.org/T321193 (10ayounsi) p:05Triage→03Medium [14:03:22] 10netops, 10Infrastructure-Foundations, 10SRE: Ramp up SV1 IXP - https://phabricator.wikimedia.org/T321193 (10ayounsi) [14:37:56] (HAProxyEdgeTrafficDrop) firing: 32% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [14:38:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 66.55825287357753% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [14:42:56] (HAProxyEdgeTrafficDrop) resolved: (2) 47% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [14:43:16] (VarnishTrafficDrop) resolved: Varnish traffic in eqiad has dropped 63.276856395744524% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [17:26:56] (HAProxyEdgeTrafficDrop) firing: 31% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:27:16] (VarnishTrafficDrop) firing: (4) Varnish traffic in drmrs has dropped 56.68580317020353% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [17:31:56] (HAProxyEdgeTrafficDrop) resolved: (4) 59% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:32:16] (VarnishTrafficDrop) resolved: (8) Varnish traffic in drmrs has dropped 51.01680473526143% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [18:58:43] bblack: Would it be a good time to try again to remove the LVS service? [19:44:05] mutante: maybe! [19:51:27] bblack: heh.. eh.. yea, I have about 3 changes. one removes it from conftool-data, one removes it from common/service.yaml (I am not sure yet if they should be combined, but doesn't matter if I merged close together) and then finally to remove from DNS [19:52:12] the removal from contool-data meant last time it takes a couple minutes and then there are alerts [19:53:13] and I did follow the step "confctl decomission", "then remove from conftool-data" [19:53:31] can't follow the exact docs because it doesn't have a discovery record [20:00:49] mutante: ok cool, can you link me the changes and I'll have a look before we try? [20:03:55] bblack: conftool-data: https://gerrit.wikimedia.org/r/c/operations/puppet/+/844041 service.yaml: https://gerrit.wikimedia.org/r/c/operations/puppet/+/843522 DNS: https://gerrit.wikimedia.org/r/c/operations/dns/+/831627 [20:04:40] docs 1: https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service (this is where it says to remove discovery DNS which we dont have) [20:05:18] also that is where the ipvsadm commands and pybal restarts are which are scary to me [20:05:45] docs 2: https://wikitech.wikimedia.org/wiki/Conftool#Decommission_a_server [20:05:57] this only talks about decom'ing a single server, but not a whole service [20:06:46] but it's what made me run "confctl decom.." followed by removing from conftool-data [20:08:07] yeah [20:08:30] so basically, what I expect we need to do (but reality could prove me wrong!) is: [20:09:15] 1. merge your first two patches (the DNS one can wait till after everything else), and puppet-merge it [20:10:17] 2. affected pybal hosts (2 per core dc): run puppet agent then restart pybal, safely [20:10:43] 3. puppetmasters: cleanup conftool stuff (probably there will be some .err file leftover somewhere) [20:10:57] that's what I expect, roughly, anyways. we'll have to see what snags we find [20:11:15] [oh and the ipvsadm cleanup on the lvs hosts too, after the pybal restart] [20:11:44] ok! [20:11:58] I think we can expect 12 alerts :) [20:12:20] 4 x .err files, 4 x pybal ...etc [20:12:25] yeah something like that [20:12:46] should we try to downtime any of that in Icinga first? [20:12:57] or keep it so we can watch it [20:12:58] nah just keep it [20:13:20] ok. well.. then, I will start [20:13:28] ack [20:15:07] let me know after the puppet-merge is done [20:15:45] puppet merge is done on master, both patches [20:16:34] ok, any obvious errors out of it when it did conftool changes? [20:16:37] affected hosts should be: 1008, 1010, 2008, 2010 afair [20:16:58] 1020 + 1017 in eqiad, 2010 and 2008 in codfw [20:17:10] erro sorry [20:17:17] 1020 + 1018 in eqiad, 2010 and 2008 in codfw [20:17:26] should I run the "confctl decommission.. " command on both backend hosts? [20:17:34] I don't know what that does, so let's not [20:17:45] it sets pooled=inactive [20:17:59] we don't need anything set to pooled=anything. we want it all to cease existing [20:18:06] ack [20:18:33] it looks gone in etcd right now [20:18:47] looking with: confctl select service=git-ssh get [20:19:12] puppet removed it from pybal.conf on lvs1020 [20:19:25] right [20:19:27] running puppet on conf1007 [20:19:38] go ahead and kick off agent runs on all the affected lvses (it won't restart pybal, just amend the configs) [20:20:05] ok [20:21:04] the part we have to step through carefully in the correct order, is restarting pybals + manual ipvsadm removal [20:21:07] meanwhile on conf1007: removed from etc/nginx/sites-available/etcd-tls-prox [20:21:21] - location /v2/keys/conftool/v1/pools/eqiad/phabricator { [20:21:23] etc [20:21:28] ok [20:21:52] puppet run on the 4 LVS hosts [20:21:57] ok [20:21:57] ran [20:22:15] so for the tricky-ish part, let's start with lvs2010 (backup lvs in codfw) [20:23:14] basicaly the process there is: systemctl restart pybal.service, then the ipvsadm command which is specific to the two (v4 + v6) listen addresses in each DC [20:23:46] ipvsadm -Dt '208.80.153.250:22' [20:24:02] ipvsadm -Dt '[2620:0:860:ed1a::3:fa]:22' [20:24:09] D for delete, t for tcp [20:24:18] ack [20:24:20] this takes the entry out of the live LVS config [20:25:06] double checked IP. doing that [20:25:43] restart pybal - done [20:26:09] ipvsadm commands - done [20:26:13] I saw the ipvs entries dissappear [20:26:20] the whole list can be seen with 'ipvsadm -Ln' [20:26:30] or I was confirming these by watching the output of 'ipvsadm -Ln|grep :22' [20:26:53] I see. yes, it looks gone [20:27:00] so now the same process on lvs2008 [20:27:12] restart pybal, two same IP:port remove commands after [20:27:56] we are getting some alerts now in -operations [20:28:01] ok, lvs2008 [20:28:48] lvs2008 - done [20:29:14] it looks gone from the list [20:29:17] ConfdResourceFailed is the ones persisting I think, we can get to those after the lvs work [20:29:35] so yeah, now loop through the same thing on lvs1020 then lvs1018, using the IPs for that site [20:29:55] 208.80.154.250 + 2620:0:861:ed1a::3:16 [20:31:25] ACK, IP confirmed. lvs1020 incoming [20:33:24] lvs1020 - done [20:34:50] ready for lvs1018? [20:34:52] yup! [20:35:03] ipvsadm -Ln does have same unrelated output here [20:35:08] that it did not have in codfw [20:35:11] when doing | grep :22 [20:35:21] yeah those are just mismatches from a not-very-specific grep [20:35:24] but that's just cp1079 [20:35:32] ack, moving on [20:35:34] (the IPs happen to have the string ':22' in them, but not th eport) [20:35:43] yep [20:36:21] done! [20:36:43] ok now the toml alerts, I'm pretty sure are just .err files from temporary conditions on the puppetmasters [20:36:48] deleting .err files on puppemaster1001 [20:36:50] right? [20:36:55] I see them [20:37:23] deleted on puppetmaster1001 [20:37:27] yeah: rm -f /var/run/confd-template/.git-ssh*.err [20:37:34] (on both puppetmasters) [20:38:31] done, on both [20:38:56] checking Icinga web UI [20:39:35] I think icinga already lost/unconfigured the alerts anyways [20:39:50] the ConfdResourceFailed one is directly in alertmanager [20:40:07] wow, nothing there [20:40:15] it went away now, the alertmanager ones, in the UI [20:40:17] right, that's the jinxer-wm part [20:40:21] but it never reported resolution to IRC [20:40:26] ah, ok [20:41:03] I'm guessing there's some model problems there. Anything that fires an alert to IRC should also eventually fire an IRC resolution messag when it clears, IMHO [20:41:11] but we have plenty of alerts that behave bad like that :) [20:41:20] unrelated alerts - pybal backends down in ulsfo [20:41:37] when looking at alertmanager web [20:41:48] I agree about the recoveries [20:42:28] hmmm yeah, those ulsfo ones seem to be incomplete decoms, will get with sukhe and resolve, thanks! [20:42:34] all clear on the git-ssh removal at this point? [20:42:47] I think so:) [20:42:51] thank you very much [20:42:55] oh look, jinxer did eventually tell IRC [20:43:15] ACK, nice :) [20:44:16] my ticket did not get all my logs.. but other than that, yay [20:45:09] I missed the !log keyword, duh:) fixing that, then done [20:46:09] 10Traffic, 10Performance-Team, 10SRE, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10BCornwall) [20:47:32] bblack: wait, you said lvs1017 before we started but then we did not do 1017, but 1018 [20:48:04] right, 1017 was a typo [20:48:10] ok, good:) [20:52:05] 10Traffic, 10Performance-Team, 10SRE, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10BCornwall)