[04:13:19] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) a:03Smalyshev [08:45:08] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Ladsgroup) >>! In T99531#5136277, @Dzahn wrote: >> Also note since recently we now have wikibase.org (https://gerrit.wikimedia.org/r/c/operati... [13:34:32] 10Traffic, 10MobileFrontend, 10Operations, 10TechCom-RFC, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Izno) [14:10:11] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10BBlack) Re: `wikibase.org`, adding it as a non-canonical redirection to catch confusion from those that manually type URLs is fine, but we sho... [14:28:52] bblack: on the DNS issue with china [14:29:00] what do we mean by 'polluted' ? [14:34:47] I think in that context the user just means "DNS interference by GFW" [14:52:46] 10netops, 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) 05Open→03Resolved [14:55:14] oh I see [14:55:52] but our change was probably unrelated with en.w.o being inaccessible ? [15:00:38] it's complicated, but probably? [15:00:57] most of the discussion on this topic is happening in non-transparent places, sorry [15:01:09] meeting time [15:09:35] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10WMDE-leszek) thanks for the write up @BBlack, I am going to take over the domain ownership topic from WMDE side, as it apparently has fallen t... [16:07:12] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10BBlack) @WMDE-leszek Thanks for looking into it! I believe @CRoslof is who you want to coordinate with on our end, whose last statement on th... [17:03:06] 10netops, 10Analytics-Kanban, 10EventBus, 10Operations: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10ayounsi) 05Open→03Resolved I assumed you needed HTTPS and not HTTP based on T219552, but please reopen if it's wrong. [17:07:33] 10netops, 10Analytics-Kanban, 10EventBus, 10Operations: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) HTTP is enough for now, thanks. If/when this gets exposed publicly we'll put it through the usual frontend nginx tls stuff there. Thank you! [18:34:39] 10netops, 10Analytics-Kanban, 10EventBus, 10Operations: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) Hm, @ayounsi: `lang=shell [@stat1004:/home/otto] $ curl -Iv http://schema.svc.eqiad.wmnet:8190/repositories/ * Trying 10.2.2.43... [@stat... [18:35:36] 10netops, 10Analytics-Kanban, 10EventBus, 10Operations: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) Ah I think that might have been my fault. T219552 doesn't specify a port; that task description was made before service was implemented. P... [18:36:13] 10netops, 10Analytics-Kanban, 10EventBus, 10Operations: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) 05Resolved→03Open [19:17:52] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) >>! In T99531#5137543, @BBlack wrote: > Re: `wikibase.org`, adding it as a non-canonical redirection to catch confusion from those that... [20:52:59] [21:49:27] hello, anyone know anything about wikipedia.org TLS impl? I do 2 get requests, potentially in the same session via keep-alive, potentially not, but able for session re-use, and it looks to do each request in separate TLS keys. Full handshake for both requests rather than keepalive or session id reuse. [20:55:28] uh also we got a (non-paging) alert for traffic drop and it does look real https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1556224608805&to=1556225676342 [20:56:33] https://grafana.wikimedia.org/d/000000341/dns?orgId=1&refresh=1m&from=1556224608805&to=1556225676342 [20:58:15] https://grafana.wikimedia.org/d/000000180/varnish-http-requests?from=1556224608805&to=1556225676342&orgId=1 [20:59:22] bblack: ^ [21:05:33] yeah, that's crazy [21:05:48] I know arzhel just took off for an appointment so he won't be here to dig on network links [21:06:03] but it certainly looks like it was pretty broad like a link/bgp kind of issue [21:07:30] I'm rather curious to know how to dig into that stuff, if it doesn't involve too much arcane sorcery [21:07:58] things that don't involve arcane sorcery are kinda boring :) [21:08:05] * Platonides thinks that there is some level of arcane sorcery involved [21:08:14] :) [21:08:31] hey, hey, I said *too much* ;) [21:21:45] eqiad->esams ssh latency went up around that time: https://grafana.wikimedia.org/d/000000387/network-probes?orgId=1&var-datasource=eqiad%20prometheus%2Fglobal&var-target=%5B2620:0:862:1:91:198:174:113%5D:22&from=1556223546301&to=1556226611848 [21:22:05] ping offload in eqiad went weird around then: https://grafana.wikimedia.org/d/000000513/ping-offload?refresh=30s&orgId=1&from=now-1h&to=now [21:22:13] yeah based just on the initial links from cdanis, clearly a large amount of eqiad's routing suffered some big loss/latency event [21:22:23] I'm starting an incident report [21:22:29] the dns requests going to zero is telling, while other stuff didn't go to zero [21:22:47] (dns recursors tend to be smart and they know they have three options, so they immediately begin avoiding the lossy/latent one) [21:24:00] but then again, the others authdns didn't show a corresponding spike of requests either, so maybe that thinking is wrong [21:25:02] probably unrelated then? 18<icinga-wm18> PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:25:16] the authdns graphs are really weird actually, looking closer [21:25:52] almost more like it was a dropout of prometheus data availability rather than the requests themselves [21:26:16] yeah, I've gotten some inconsistent results [21:26:17] (at least, in the middle. there's some actual recorded ramp down before it, and a ramp back out afterwards) [21:26:28] but I don't think the whole event is artificial [21:26:42] just maybe we also have strange artificial things going on with our graphs as well [21:27:13] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=prometheus1003&var-datasource=eqiad%20prometheus%2Fops&var-cluster=prometheus&from=1556224075036&to=1556226210110 [21:27:20] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=1556224075036&to=1556226210110&var-server=prometheus1004&var-datasource=eqiad%20prometheus%2Fops&var-cluster=prometheus [21:27:42] this happened on both prometheus servers -- way more than usual CPU load, lots of allocations, I'm guessing eventual OOM event [21:28:09] https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1556223851824&to=1556226195354&var-site=eqiad&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 shows a gap [21:28:19] hmmmm [21:28:42] ok, so maybe a better question to ask then, is there any evidence of problems where the evidence is independent of prometheus? [21:29:09] [3145345.030142] Out of memory: Kill process 15366 (prometheus) score 936 or sacrifice child [21:29:11] [3145345.039492] Killed process 15366 (prometheus) total-vm:629872744kB, anon-rss:92657964kB, file-rss:0kB, shmem-rss:0kB [21:29:13] from prometheus1003 [21:29:35] e.g. if it were a real problem with routing, we should see effects in librenms traffic graphs of transit/peering too [21:29:42] that would be independent. [21:29:46] or catchpoint maybe [21:29:54] no more catchpoint ;) [21:29:57] heh [21:30:04] however, we should have seen RIPE Atlas alerts [21:30:06] we still have some statsd-based things [21:30:13] (for traffic levels, I think) [21:30:35] okay [21:30:37] FWIW [21:31:00] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&from=now-6h&to=now [21:31:14] ^ this isn't from prometheus, and shows no dropoff in HTTP request-rate hitting eqiad front edges [21:31:33] (statd-based) [21:31:34] yeah, this does look like a prometheus event [21:32:42] ok [21:33:27] timeline doesn't quite make sense to me -- what I see is prometheus being oom-killed on prometheus1004 at 20:41:54, and 20:41:42 on prometheus1003 [21:33:38] this event seems like a good reminder not to have monitoring-spofs (have at least two independent ways to verify basic outage data) [21:33:45] I also don't understand why it happened on both hosts in a synchronized fashion [21:34:13] likely induced by a huge influx of data from some source? and likely prometheus was already falling behind or losing data before oomkill [21:34:31] that's not really how prometheus ingests data, though [21:35:01] well by that you mean prometheus itself does the fetching right? [21:35:06] yeah [21:35:10] but I mean, maybe the fetching pulled in way more MBs of data than usual [21:35:29] something that sends true events rather than a value-per-time-interval [21:35:57] anyways I've gotta run now too, but I'm assuming so far that real things are sane/stable [21:36:02] they do seem to be [21:36:16] I'll spend a bit of time digging into what happened wrt prom [23:18:23] I was at the dentist so I'm surprised it was not a network outage :) [23:19:16] cdanis: speaking of external monitoring, this could be interesting to implement too: https://labs.ripe.net/Members/daniel_czerwonk/using-ripe-atlas-measurement-results-in-prometheus-with-atlas_exporter [23:19:57] ooh thanks XioNoX [23:20:24] we can see network event, like sudden change of network hops, etc... [23:20:32] and latency obviously [23:21:04] and maybe replace the ripe atlas icinga alerts by something smarter using that data [23:42:04] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190425-prometheus [23:54:29] 10Traffic, 10Operations, 10decommission, 10ops-codfw: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286 (10RobH)