[00:54:20] 10netops, 10Operations: asw1-eqsin vcp port flapping - https://phabricator.wikimedia.org/T192125#4128813 (10ayounsi) p:05Triage>03High [01:20:45] 10netops, 10Operations, 10Patch-For-Review: Juniper HA audit - https://phabricator.wikimedia.org/T191667#4128840 (10ayounsi) From JTAC, the nonstop-routing issue most likely have been caused by a Junos bug where the following commit sometimes enables nonstop-routing before disabling graceful-restart, while t... [07:51:14] 10Traffic, 10DC-Ops, 10Operations, 10ops-codfw: lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082#4129162 (10ema) p:05Triage>03Normal [07:51:38] 10Traffic, 10netops, 10Operations, 10Pybal: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4129163 (10ema) [08:06:17] https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS --> it would be awesome to automate the pooling/depooling described here with a small BGP daemon, something like gobgpd O:) [08:25:40] vgutierrez: (asking because of ignorance about BGP) this would mean having a daemon running somewhere (announcing routes for ns IPs) rather than having static routes on cr1/cr2? [08:25:44] we'll get somewhere even better than that eventually [08:26:14] but baby steps for now as we have time! :) [08:26:21] elukey: that's what I mentioned gobgpd :) [08:26:26] s/what/why/g [08:26:29] bblack!!! [08:26:32] go back to sleep! [08:26:49] hmmm back.. or already... [08:27:24] oh I'm up for the day at this point, with a backlog of things I want to get done before our meeting later [08:27:34] :O [08:28:30] vgutierrez: yep yep I was trying to get the picture :) [08:28:32] anyways, the long-term vision is something like this: we have a pair of dnsN00x at each site, and they all run our authdns + recdns daemons, and they all advertise BGP to routers for service, and the IPs for both auth- and rec- dns are anycasted ones. [08:28:54] there's some nits to sort out on the fine details and reliability/monitoring/failure-mode stuff, but we'll get there [08:29:02] arzhel already did a lot of it for the recdns case [08:29:23] but basically I don't want LVS in the picture at all [08:29:28] (for the dns services) [08:30:57] the loopy layering of recdns+lvs today is troublesome. everything needs recdns, but recdns needs lvs which needs recdns... [08:31:22] and recdns is mostly used for internal lookups, so it all depends on our authdns working right too [08:31:43] having it all on the same hosts means the common lookup of our own data from recdns->authdns is over the loopback [08:32:16] and having them both advertise as anycast means failover both within a DC and between DCs for those critical services is simpler (clients can be unaware and keep hitting the same service IP) [08:32:45] yey, basically our own "8.8.8.8" [08:33:11] yeah something like that, but obviously on a private-space anycast for our recdns [08:33:23] and then a separate pair of public anycasts for our authdns as well [08:33:50] (which should help with dns latency from public caches -> our authdns, too, in the average) [08:35:54] yup, BGP magic :D [08:36:57] I'm about to depool nescio.w.o BTW [08:39:33] bblack: hey :) [08:40:12] I've checked the geomapping of Opera Mini's XCIP, they're all correct (eg: singapore -> singapore, amsterdam -> amsterdam, US -> US) [08:42:43] vgutierrez: you can use this to double-check if things work fine: https://grafana.wikimedia.org/dashboard/db/dns-recursors?orgId=1 [08:46:00] hmmm that dashboard needs some love... dns5* is missing [08:46:11] ah nice catch [08:46:23] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Gilles) Do you want to rephrase this task's description to be about the incident? As a "let's investigate what happened"... [08:47:32] and it looks like that esams DNS rec dns servers are not being used [08:47:45] I highly suspect those 14 requests are generated by pybal monitors [08:48:35] that, or the dashboard is utterly wrong [09:02:11] they are being used, see `sudo tcpdump port 53 and dst host dns-rec-lb.esams.wikimedia.org` on maerlant [09:08:45] I just added eqsin to https://grafana.wikimedia.org/dashboard/db/dns-recursors?orgId=1 [09:10:05] vgutierrez: OK, I'd work on the dashboard to add the prometheus datasource and server name as template variables if you agree [09:10:41] it seems more useful to be able to choose the DC/server instead of seeing the whole list? [09:11:13] yup [09:11:22] alright, I'll make it so [09:12:19] hmmm something is wrong with the dashboard... [09:12:50] or not... [09:19:22] it's all right, https://grafana.wikimedia.org/dashboard/db/dns-recursors?orgId=1&panelId=13&fullscreen [09:19:49] if you pick only maerlant and nescio you can clearly see how nescio has been depooled and maerlant got all the load [09:20:20] but as ema said, seeing esams besides eqiad or codfw is not really useful [09:29:12] 10Traffic, 10Operations: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129273 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['nescio.wikimedia.org'] ``` Of which those **FAILED**: ``` ['nescio.wikimedia.org'] ``` [09:29:32] *sigh* :) [09:35:14] done, you can now choose the DC/host on https://grafana.wikimedia.org/dashboard/db/dns-recursors [09:35:47] lovely :D [09:36:28] for some reason reimage-host failed to handle the new puppet certs for nescio.w.o [09:37:03] but it's already fixed & running puppet for the first time [09:38:51] mmh yeah, the logs are not particularly informative [09:39:25] 2018-04-13 09:29:07 [INFO] (vgutierrez) wmf-auto-reimage::print_line: Unable to run wmf-auto-reimage-host: Failed to puppet_generate_certs [09:39:28] 2018-04-13 09:29:07 [ERROR] (vgutierrez) wmf-auto-reimage::main: Unable to run wmf-auto-reimage-host [09:39:33] indeed [09:40:10] it looked like it wasn't able to run puppet agent to get the new cert generated [09:40:30] cause when I ran it manually, a new cert got generated [09:41:26] (one day I'll understand why my tmux c/p always add unpleasant newlines) [09:41:36] hmmm something is not completely ok wih nescio [09:42:01] apt (done by puppet) is going really slow [09:45:14] so I'm looking at puppetboard: https://puppetboard.wikimedia.org/report/nescio.wikimedia.org/abf30d3768ab20ee656b452ba0ce707da628392c [09:45:39] right, that's the last puppet run before reimaing [09:45:42] *reimaging [09:45:55] it looks like the timezone is not set to UTC, it's now 09:45 UTC and that puppet run says 10:38 [09:46:00] take into account that info is uploaded to puppetdb after puppet ends running [09:46:24] uh, you're right [09:46:27] what's the time on nescio? :) [09:48:03] +++ /tmp/puppet-file20180413-600-1als4e5 2018-04-13 09:46:49.486045576 +0000 [09:48:07] hmm right now is already in UTC [09:48:20] but that puppet run was done with jessie [09:48:52] arg [09:49:04] ema, puppetboard is a smart ass and it's changing the TZ for you [09:49:32] I saw it loading https://puppetboard.wikimedia.org/nodes [09:49:53] you see how it show raw dates with +0000 (UTC TZ), and then it changes the hours to match the browser TZ [09:50:05] oh man [09:50:15] take for instance https://puppetboard.wikimedia.org/node/lvs5003.eqsin.wmnet [09:50:24] it's showing 11:40 here [09:50:34] and it really means 11:40 CEST/UTC+2 [09:50:55] maybe we can ask to volans|off to change this [09:51:04] ugh, I hate things that ever track or display non-UTC times (well, in our technical world anyways) [09:51:29] I keep my IRC bouncer and my laptop's clock on UTC as well just to cut confusion heh [09:51:54] https://github.com/voxpupuli/puppet-puppetboard/pull/30 [09:52:06] it looks like disabling localise_timestamp should be enough [09:52:27] that seems like a good idea, it's gonna be fun to look at puppetboard next time I'll come to the US otherwise [09:52:33] hahahah [09:53:24] let's take a look into puppetboard puppetization [09:53:41] what's nescio up to in the meantime? Still apt-getting stuff? [09:53:57] this is also why we should all move to Iceland, to avoid ever seeing non-UTC clocks [09:54:20] ema: nope.. deploying $HOME files [09:54:35] what's up? [09:55:07] I've a CR for you :) [09:56:08] yeah the time can be changed is just config [09:56:12] what about the reimage issue? [09:57:02] bblack: Canary Islands might be a better choice weather-wise, I'm willing to accept the UTC+1 confusion during summer [09:58:34] vgutierrez: do you agree that the readme is confusing? LOCALISE_TIMESTAMP: Normalize time based on localserver time. [09:58:51] volans|off: oh god, it sucks [09:59:47] LMK when is merged that I can run puppet and restart uwsgi on both hosts [10:00:05] ack [10:00:53] volans|off: done [10:01:02] ack [10:01:37] rebooting nescio after puppet run.... [10:03:36] vgutierrez: :( [10:03:47] volans|off: what happens? [10:04:09] 2 problems, 1) the showed time is ugly 2) it's ugly because the JS call to the localise_timestamp fails... I'll debug it [10:04:33] yes.. it's pretty ugly indeed [10:04:50] but because JS fails and an xhr request too, looking [10:06:54] ok the xhr is unrelated, looking at the JS :( [10:07:21] vgutierrez: cool, nescio is back online and looks fine [10:07:35] https://grafana.wikimedia.org/dashboard/db/dns-recursors?orgId=1&var-datasource=esams%20prometheus%2Fops&var-server=All [10:08:08] yep [10:08:14] I've sent a couple of requests myself, hence the 'questions' graph going up despite nescio being depooled [10:08:18] me too [10:08:26] I was testing the DNS manually [10:08:37] vgutierrez: bug on their side, that doesn't honor the configuration :( [10:08:58] volans|off: honestly I prefer ugly to TZ mayhem [10:09:10] yes, but JS is broken [10:09:16] you get the 'processing' overlay [10:09:22] well.. that's by design [10:09:25] O:) [10:10:24] all this js+xhr mess, it should've been a java applet! :) [10:10:35] * bblack rewinds to 1999 [10:10:45] (╯°□°)╯︵ ┻━┻ [10:10:47] activex! [10:11:47] in general the edge sites don't see much recdns traffic, that's normal [10:12:01] our caches don't spam dns requests and not much else runs there. [10:12:14] whereas in the core DCs, we have a ton of stuff that spams a lot of reqs [10:12:30] they did see quite a lot when the various varnish python statsd daemons were resolving statsd.eqiad.wmnet for every single UDP packet they sent :) [10:12:54] ouch [10:13:40] well.. dns is working, loopback has the proper IPs, icinga is all green... [10:13:46] let's repool it [10:13:49] ema: someday we need to loop back to your longstanding half-done glibc resolver hacks stuff :) [10:14:22] although I think if we're aiming towards anycast on the recdns side, we don't necessarily need every bell and whistle there anymore [10:14:33] just fast spammy retries of the anycast IP basically. [10:14:48] BTW, maybe it would be good to check in icinga that lo is configured as expected for recdns instances? [10:15:14] bblack: yes, the main part is basically working. Now we just need to port it to rust+tokio and we can ship it! :) [10:16:23] vgutierrez: https://github.com/voxpupuli/puppetboard/issues/461 [10:17:05] not sure if we should keep it UTC ugly and broken for now or revert it back to client time [10:17:22] I have an itch lately to write something etcd-like, but with a smaller/simpler focus and natively better-supporting the wide-area case. [10:17:59] like, everything is constrainted to the case of a relatively-small set of keys and very simple ops. the kind of use-case we have for pooled=yes sort of stuff only, not general/complex data. [10:18:35] and given it's designed for a small dataset, errs on the side of using spammy protocols to be more reliable and sync faster in the face of partitions, etc (and handle the wide-area case better in general) [10:20:05] 10Traffic, 10Operations: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129314 (10Vgutierrez) [10:20:07] 10Traffic, 10Operations: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#3964044 (10Vgutierrez) [10:21:47] maybe even handle byzantine failure, too. it all gets easier when you put heavy constraints on data types and sizes. [10:22:47] google's published overviews of their spanner/truetime designs contain a lot of interesting insights on the subject too. not so much the low-level details, but more the stuff about how to really think of the CAP tradeoffs, and how to really think about how partitioning works in reality, etc. [10:22:50] vgutierrez: regarging nescio reimage I can see that the first supposedly quick puppet run that should generate the certificate request timed out [10:23:43] (the stuff in this paper is what I'm thinking of, on that subject: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45855.pdf ) [10:24:14] w/ or w/o atomic clocks? :-P [10:24:57] for my hypothetical minimal etcd thing, no :) [10:25:41] but I think some of the thinking from that paper about their big general global sql-like thing, can be ported back to thinking about super-simple KV stores as well, more in the etcd-or-less sort of realm. [10:26:10] there's some pragmatic things to think about there with how paritioning works in the real world and how much you care that one isolated site is unavailable, etc. [10:26:35] (and being able to dynamically size-down a cluster so that it can continue to tolerate increasing failures and still get leader election right) [10:27:15] yeah [10:27:56] also latency matters less for the case I'm looking at. propogating a "serverX-is-depooled" flag globally can be a bit latent and that's ok. if you have a global network and it takes even 3-5 seconds to ack that, it's kind of ok. [10:28:35] at that kind of level, you can probably do something like their TrueTime just using NTP and broader windows of uncertainty to cover its lower level of precision. [10:31:31] (and you can optimize heavily for the collision/race-free case, since it's probably not ever normal in this scenario to have conflicting state changes for a key coming from different places exactly-simultaneously) [10:43:45] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129338 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` maerlant.wikimedia.org ``` The log can be found in `/var/log/wmf-aut... [10:52:35] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Graphite (or Prometheus) - https://phabricator.wikimedia.org/T131894#4129365 (10Gilles) a:03Gilles [10:52:57] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Collect Backend-Timing in Graphite (or Prometheus) - https://phabricator.wikimedia.org/T131894#2182123 (10Gilles) [10:56:10] re T131894 we don't aim to get rid of that as well in favour of a mtail + prometheus version? [10:56:10] T131894: Collect Backend-Timing in Graphite (or Prometheus) - https://phabricator.wikimedia.org/T131894 [11:14:35] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129429 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maerlant.wikimedia.org'] ``` Of which those **FAILED**: ``` ['maerlant.wikimedia.org'] ``` [12:29:52] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129515 (10Vgutierrez) [14:28:58] 10netops, 10DBA, 10Operations: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129750 (10jcrespo) Adding the tag to reflect work done at network layer. [15:08:05] 10netops, 10DBA, 10Operations, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129814 (10ayounsi) ```name=db1114 ethtool eno1 Supported pause frame use: No Advertised pause frame use: Symmetric Link partner advertised pause frame use: No ``` ```name=db1114's switch... [15:09:22] 10netops, 10DBA, 10Operations, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129816 (10Marostegui) @ayounsi thanks for your help. If you want to compare it with the other two servers that receive exactly the same traffic, those are: db1066 and db1080. [15:23:02] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4129858 (10Nuria) Could we restore proxies now that the nice opera folks gave us their list? Clearly we also need to look into why/how t... [15:26:04] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4129878 (10BBlack) Yeah, ema and I discussed this after the meeting the other day. I'm not sure whether or how we can look into the hi... [15:27:51] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4129884 (10Nuria) >we're planning to just stop pulling that empty data from them, and replace it with a private file that's puppet-manage... [16:26:42] 10netops, 10DBA, 10Operations, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130018 (10ayounsi) 1/ Flow-control not helping, reverted 2/ Are the other servers seeing the same bursts of inbound sessions? 3/ The `ifconfig` input drop counter matches the nic stats... [16:27:07] 10netops, 10DBA, 10Operations, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130019 (10Marostegui) So given that db1066 and db1080 have the same traffic than db1114 (and even more when db1114 gets depooled from API) and they don't suffer any kind of issues, could... [16:34:14] 10netops, 10DBA, 10Operations, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130039 (10Marostegui) >>! In T191996#4130018, @ayounsi wrote: > 1/ Flow-control not helping, reverted > Cool > 2/ Are the other servers seeing the same bursts of inbound sessions? Th... [16:43:04] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130046 (10Marostegui) [18:14:25] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4130337 (10ayounsi) [18:14:27] 10Traffic, 10netops, 10Operations, 10Pybal: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4130334 (10ayounsi) 05Open>03Resolved a:03ayounsi Renamed. Feel free to re-open that tasks for the future hosts. [18:26:09] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4130381 (10atgo) @Nuria @BBlack thanks for all the work on this! Once it's resolved, will the data for the time window that this was an i... [18:34:11] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4130422 (10Nuria) @atgo, It cannot be, we no longer have the original Ips of the records that are wrongly labeled. [18:36:22] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4130431 (10atgo) Ok, thanks @nuria