[09:35:38] 10Domains, 10Traffic, 10Operations: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10akosiaris) p:05Triage→03Medium [09:36:43] 10Traffic, 10DNS, 10Operations: Verify diff.wikimedia.org ownership for Facebook - https://phabricator.wikimedia.org/T259807 (10akosiaris) p:05Triage→03Medium [12:25:25] 10Traffic, 10Operations: Generate ATS cache.config from software-agnostic data structures - https://phabricator.wikimedia.org/T259692 (10ema) 05Open→03Resolved Done, `profile::trafficserver::backend::caching_rules` is now gone. `cache.config` is generated by parsing `req_handling` and `alternate_domains`.... [13:01:19] I'm currently puzzled by the kibana-next alert on icinga: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash1025.eqiad.wmnet are marked down but pooled [13:01:39] host is pooled indeed, https://config-master.wikimedia.org/pybal/eqiad/kibana-next [13:01:55] and proxyfetch thinks it is up, yet pybal /alerts doesn't [13:02:18] lvs1015:~# curl -s localhost:9090/metrics | grep -i status.*logstash1025 [13:02:21] pybal_monitor_status{host="logstash1025.eqiad.wmnet",monitor="ProxyFetch",service="kibana-next_443"} 1.0 [13:02:28] lvs1015:~# curl http://localhost:9090/alerts [13:02:28] CRITICAL - kibana-next_443: Servers logstash1025.eqiad.wmnet are marked down but pooled [13:10:17] mmh [13:10:34] indeed according to pybal logs it's up [13:10:35] Aug 07 10:44:23 lvs1015 pybal[11890]: [kibana-next_443] INFO: Server logstash1025.eqiad.wmnet (disabled/partially up/not pooled) is up [13:11:23] well, 'partially up' [13:12:05] true, also since then I've depooled/repooled and afaict now it should be fully up [13:14:30] I'm tempted to restart pybal on lvs1016 and see if that "helps" [13:17:10] godog: try first to disable/re-enable the host in etcd maybe [13:17:59] I've done that already but no harm in trying again [13:18:48] {{done}}, from the host via depool/pool scripts that is [13:18:59] Aug 07 13:18:04 lvs1015 pybal[11890]: [kibana-next_443] INFO: Merged enabled server logstash1023.eqiad.wmnet, weight 10 [13:19:02] Aug 07 13:18:04 lvs1015 pybal[11890]: [kibana-next_443] INFO: Merged disabled server logstash1025.eqiad.wmnet, weight 10 [13:19:05] Aug 07 13:18:04 lvs1015 pybal[11890]: [kibana-next_443] INFO: Merged enabled server logstash1024.eqiad.wmnet, weight 10 [13:19:10] it is puzzling [13:19:27] and then: [13:19:32] Aug 07 13:18:27 lvs1015 pybal[11890]: [kibana-next_443] INFO: Merged enabled server logstash1025.eqiad.wmnet, weight 10 [13:20:19] now IdleConnection seems happy too [13:20:29] pybal_monitor_status{host="logstash1025.eqiad.wmnet",monitor="IdleConnection",service="kibana-next_443"} 1.0 [13:21:06] indeed, and /alerts keeps reporting "down but pooled" [13:23:11] I see 3 pooled servers according to the dashboard too: https://grafana.wikimedia.org/d/000000421/pybal?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1015&var-service=kibana-next_443 [13:23:56] and indeed the host is pooled in IPVS: [13:24:00] TCP kibana-next.svc.eqiad.wmnet: sh -> logstash1023.eqiad.wmnet:htt Route 10 0 0 -> logstash1024.eqiad.wmnet:htt Route 10 0 0 -> logstash1025.eqiad.wmnet:htt Route 10 0 0 [13:24:04] bleah [13:24:05] TCP kibana-next.svc.eqiad.wmnet: sh [13:24:07] -> logstash1023.eqiad.wmnet:htt Route 10 0 0 [13:24:10] -> logstash1024.eqiad.wmnet:htt Route 10 0 0 [13:24:13] -> logstash1025.eqiad.wmnet:htt Route 10 0 0 [13:25:59] godog: +1 for restarting, it looks like we're dealing with an instrumentation.py bug to me [13:26:49] ema: ack! same here, I'll restart lvs1016 [13:29:04] and indeed a restart "fixed" /alerts on lvs1016 [13:29:34] I'll bounce pybal on lvs1015 too [13:29:55] 'hooray' [15:05:30] 10Traffic, 10Operations, 10Phabricator, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10hashar) [15:06:53] 10Traffic, 10Operations, 10Phabricator, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10hashar) I filed a dupe of this task. The admin interface states the notification server is not reachable. At https... [20:20:02] 10netops, 10Operations, 10ops-eqiad: new cloudflare xconnect to cr1-eqiad - https://phabricator.wikimedia.org/T259923 (10RobH) p:05Triage→03Medium [20:46:25] hi ema or vgutierrez ? thanks for the reviews and merge on the aphlict backend change! we were debugging why it does not fully work yet and eventually came up with this follow-up [20:46:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/619036 [20:46:44] so i guess we need to set " caching: 'websockets'" too [20:46:58] that is from comparing it to the Etherpad setup [20:47:18] but we have both wss:// and https:// connections to the same host name [20:47:41] extra: even though it's set to "normal" and not "pass" currently.. it seems like it actually does not cache anything for phab [20:48:12] do not cache is what we want.. i just expected for that it would have to be "pass" [20:51:38] the latter is because of the headers sent by phab [22:02:39] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Tgr) This also prevents me from making a card donation (via the donation link in the sidebar menu, but I imagine click... [22:04:40] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Tgr) Related: {T122097} [22:26:17] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Platonides) It could go both ways. If as an Hungarian with only Hungarian credit card, and temporarily visiting the US...