[05:56:55] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132722 (10Marostegui) [05:57:14] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4123759 (10Marostegui) p:05Triage>03High [05:58:11] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4123759 (10Marostegui) I have written a summary of the current state of debugging on the original task description, so it is easier to read instead of going thru all the co... [07:16:38] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132767 (10Marostegui) [07:35:06] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132775 (10Marostegui) [07:46:26] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` achernar.wikimedia.org ``` The log can be found in `/var/log/wmf-aut... [07:46:50] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132785 (10Marostegui) [07:54:57] so this is how yesterday's upload@esams troubles looked like: [07:54:59] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=13&fullscreen&orgId=1&from=1523756810836&to=1523841638338 [07:55:08] morning ema [07:56:39] hi vgutierrez! [07:58:07] cp3039's expiry thread managed to clean its backlog at 11:18ish on its own [07:59:05] but then it had troubles catching up for the rest of the day basically, till 21:45 when I restarted varnish-be [08:00:14] interesting to see how troubles move to other backends when a backend starts lagging behind (frontends mark it as sick and move load to other backends) [08:00:41] see 15:45ish [08:02:04] manual restarts timeline: 21:30 cp303[68], 21:45 cp3039, 22:09 cp3037 [08:04:31] 503s in this case were due to esams fe<->be fetch failures, rather than be<->be, which I think was the case w/ cache_text [08:05:41] here's the varnishospital plot: https://logstash.wikimedia.org/app/kibana#/visualize/edit/ea257030-2cf2-11e8-ba7d-9919d9911746?_g=h@33e1379&_a=h@52d1d8a [08:06:26] the link is broken (according to kibana), you should use the share function :( [08:06:47] oh [08:06:50] https://logstash.wikimedia.org/goto/54bb6490e1fa1c6209c301a1af8d4b04 [08:06:52] <3 [08:07:06] and now I get a big "Kibana did not load properly. Check the server output for more information." [08:07:10] grr [08:07:15] damn [08:08:38] I've tried changing the timeframe and generating another short link, still getting a kibana error [08:09:52] the long one seems to work: https://bit.ly/2EPTpR9 [08:10:29] the bit.ly one works indeed [08:10:37] the previous one wasn't showing any results here [08:10:56] it seems to be a known kibana issue https://github.com/elastic/kibana/issues/12915 [08:11:58] this is a case of "my tools broke my tools with my tools" [08:13:01] vgutierrez: can you create a shorturl from the bit.ly one and see if that works? [08:14:34] sure [08:15:09] LOL, I got an error from bit.ly [08:15:16] now: https://bit.ly/2H6xQgY [08:22:44] bblack: perhaps we should consider going back to two restarts a week https://gerrit.wikimedia.org/r/#/c/426858/ [08:24:34] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132831 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['achernar.wikimedia.org'] ``` Of which those **FAILED**: ``` ['achernar.wikimedia.org'] ``` [08:42:39] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4132869 (10Vgutierrez) 05Open>03stalled [09:04:49] hmm so it looks like we're missing megacli for stretch in our repos [09:05:14] we've 8.07.14-1 for jessie, but nothing for stretch apparently [09:06:22] let me check, IIRC this package is used across distros (statically linked) from an external repo # [09:06:52] 1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/thirdparty amd64 Packages [09:07:07] that's what apt is reporting on acamar [09:07:25] and nothing on acheranr (just reimaged as stretch) [09:07:31] *achernar [09:07:57] according to https://hwraid.le-vert.net/wiki/DebianPackages [09:08:04] the same version is available for stretch [09:08:33] same version meaning 8.07.14-1 [09:12:07] megacli is present in the repository component and the repository component is also present on the host [09:12:27] the component is different, with stretch forward we're using a more fine-grained scheme [09:12:35] it's thirdparty/hwraid now [09:12:47] since e.g. all our Ganeti hosts don't need thirdparty/hwraid [09:12:59] so we need to figure out why megacli doesn't get installed [09:13:38] vgutierrez@achernar:~$ apt-cache policy hwraid [09:13:39] N: Unable to locate package hwraid [09:15:29] the raid class checks the puppet fact 'raid', megacli only gets installed if that contains 'megacli' [09:15:42] maybe something changed in facter [09:16:31] vgutierrez: no, you need to run "apt-cache policy megacli", "thirdparty/hwraid" is the name of the repository component [09:16:38] yup [09:16:44] I've seen that [09:16:59] facter reports raid => ["md", "megaraid"] so seems correct [09:17:00] I see that the package *is* installed on achernar, so this is a matter of figuring out why puppet hasn't done that? [09:17:08] yey... right now it's installed [09:18:03] it was installed by puppet [09:18:08] here https://puppetboard.wikimedia.org/report/achernar.wikimedia.org/9eb91213a82f944ad92797de69cdaae0f2c2d6a7 [09:18:30] and I asked after seeing this: https://puppetboard.wikimedia.org/report/achernar.wikimedia.org/2c50ea17bf30f93c189707f8a591f5e43a1f4206 [09:20:36] so as moritzm said, facter must changed between those puppet runs [09:21:45] sorry about the noise :) [09:23:12] np! [09:23:26] so.. everything looking good for achernar (icinga, manual DNS checks...) repooling it [09:26:43] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132972 (10Vgutierrez) [09:58:21] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133017 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` acamar.wikimedia.org ``` The log can be found in `/var/log/wmf-auto-... [10:29:03] BTW, regarding that bblack said last week https://wm-bot.wmflabs.org/browser/index.php?start=04%2F12%2F2018&end=04%2F12%2F2018&display=%23wikimedia-traffic at 13:22:47, chromium and hydrogen are up next for being reimaged [10:36:14] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133111 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['acamar.wikimedia.org'] ``` Of which those **FAILED**: ``` ['acamar.wikimedia.org'] ``` [10:37:03] see https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) for example commits to drop them from the LVSes temporarily [10:37:14] yup, thx :) [11:55:42] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133269 (10Vgutierrez) [12:47:34] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133388 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['hydrogen.wikimedia.org'] ``` Of which those **FAILED**: ``` ['hydrogen.wikimedia.org'] ``` [12:53:57] ema: I was about to say: but the new failed fetches / restarts problem seems to be cache_upload, we were tracking cache_text issues before, they may be different. [12:54:13] ema: but if esams alert turns out to be caches' fault, there goes that theory. [12:54:53] whereas cache_upload is the one that historically had problems with storage + mbox lag long ago, but wasn't so much participating in the more-recent variants of the problem that seemed to vanish on vcl_hit changes and related. [12:56:52] gehel: kibana seems to be having troubles w/ short links (see eg: https://logstash.wikimedia.org/goto/1934acb94b581060b6dad36f8fd0c79e which I've just created) [12:57:03] gehel: long links work fine [12:57:22] ema: :( [12:57:37] gehel: I've found this which sounds suspiciously similar (I'm also running chrome) https://github.com/elastic/kibana/issues/12915 [12:57:41] don't know anything about that yet... I'll have a look at the logs, see if I find something obvious [12:57:45] ty! [12:58:12] vgutierrez: btw re: ntp, a quick note [12:59:00] vgutierrez: I was looking at some of the recently-reimaged, and their ntp states were a little "off", so I slowly restarted the ntp services on acamar + achernar + maerlant + nescio over the past ~30-40 mins. [12:59:20] ack [12:59:25] I'll take a look, thx [12:59:26] this is the varnishospital graph for the latest text_esams spike https://bit.ly/2qyihJ2 [12:59:34] "ntpq -p" is a good status command to check, but on the other hand there's no clear answer as to what the output there should look like, it's quite complex and never looks quite right [12:59:43] they're all in a decent state now I think [12:59:45] very evenly distributed among all backends [13:00:21] but probably the important thing is: after a fresh reinstall + re-puppetization, I don't think ntpd's initial startup gets dependencies perfect (and never will I assume), re: ferm bringup, interface flaps, resolv.conf, etc, etc.... [13:01:05] maybe best that after reaching puppet stability (re-running agent does nothing), wait ~10 minutes and then do "service ntp restart" on the newly-re-imaged, so they can restart the ntp algorithms from a sane state [13:01:35] and put a few hours between the remaining pair at codfw so the first can truly-stabilize a bit before the other goes down [13:02:48] right [13:04:39] and in general (not immediately-important), I still don't think I'm happy with how our ntp is currently architected. it's better than it was, but there are a number of choices and tradeoffs we can make, and the current set of them probably isn't deal (but not bad considering this strategy had to work for the jessie->stretch transition too, which is now nearly-complete!) [13:05:58] re: topologies of public + private server/peer/pool statements and how that affects our global clock stability inside the WMF, and how we tolerate various failure modes (the key ones being: wmf site isolated but can reach public internet, and wmf sites all connected but one (or all) can't reach public servers) [13:06:36] let's deploy 1 GPS receiver per DC O:) [13:06:48] heh you laugh, I did that at two jobs before :) [13:06:59] ema: I can reproduce the issue on kibana, and creating the empty css as explained the the GH issue seems to fix the problem. [13:07:15] ema: I'll open a phab task and see if we have a better fix than that... [13:07:26] bblack: same 4-5 jobs ago [13:07:43] but I really don't want to get into that here, it's a PITA at the dc-ops and dc-contracts/services level to run coax antennas to the roof from a cage with minimal wire length, etc [13:08:10] it's a little different if you own your own physical DCs! :) [13:10:31] gehel: awesome, thank you! [13:13:48] sigh... hydrogen is sick :/ [13:14:07] anyways, on ntp achitecture: what we do now amounts to: each server hits the public pools as upstream servers, and non-core DCs have the other local ntp server + all 4x at the cores as peers, whereas the cores have basically all our global servers as peers. [13:14:15] https://phabricator.wikimedia.org/P6997 [13:14:39] and then also some orphan settings so that in the case of loss of public reachability, the core or non-core ntp servers can take over local clocks as appropriate, with the cores getting priority if reachable [13:15:04] vgutierrez: nice. I guess it's still running fine, just with a broken side of the md mirror? [13:15:43] it's actually still rebuilding the raid-1 [13:15:50] and it's not complaining [13:15:52] ok [13:16:16] I'm going to wait till the raid-1 is rebuilt for the mandatory reboot [13:16:20] I guess we should hurry up with those replacements! :) [13:16:40] meanwhile let's open the phabricator task for the HDD failure [13:17:21] anyways, some other options for ntp: we could stop x-dc peering. just use "peer" statements between the local pair in a DC, and use "server" statements to make the core ones upstream servers of the non-core (which they'd use if they lost their local connectivity to the outside world but not the WMF network) [13:18:13] under that config in an all-is-normal scenario, we'd be basically relying exclusively on the synchronicity of different public global ntp pool servers to keep our DCs' clocks in sync. [13:19:16] or we could try to do some hacks on stratum information in the config to prefer our core DCs' servers over the public ones at the edge DCs, but that's also kind of ugly because it forces them to get their sync over higher-latency links. [13:19:41] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133478 (10Marostegui) After the last reboot the errors have moved from being at times like: XX:10:11 XX:20:11 XX:30:11 To: XX:04:11 XX:24:11 XX:34:11 [13:20:07] ema: https://gerrit.wikimedia.org/r/#/c/426912/ <- that should fix the issue, but it is ugly enough that I don't really want to merge it... [13:20:09] and then coming into all the discussion sideways is the desire to move to chrony, which is already at least mostly-implemented [13:20:11] 10Traffic, 10Operations, 10ops-eqiad: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4133479 (10Vgutierrez) p:05Triage>03Normal [13:20:42] 10Traffic, 10Operations, 10ops-eqiad: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4133479 (10Vgutierrez) SMART info about sda: ```root@hydrogen:~# smartctl -a /dev/sda smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build) Copyright (C) 2002-16, Bruce A... [13:20:55] chrony has a different/lesser set of capabilities vs the normal ntpd. I was initially thinking it was very limited re: peering local pools of servers and such, but now I tend to think it's just a bit different and under-documented, but maybe can do a lot of the same things in practice. [13:21:18] it'd be nice to have a layout that definitely works for chrony as well [13:21:28] anyways, all kind of on the lower-priority back-burner to think about! [13:22:47] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133504 (10Marostegui) [13:23:47] gehel: I think a simple 's/ugly hack/XXX: temporary workaround/' would get the job done! [13:26:37] bblack: https://gerrit.wikimedia.org/r/#/c/426896/ and https://gerrit.wikimedia.org/r/#/c/426913/, real opera mini subnets already added to the actual private puppet repo. VCL changes next (it should be as simple as using the "trusted_proxies" netmapper database instead of "proxies" [13:26:47] ) [13:28:28] chrony should be able to provide a feature-equivalent setup to ntpd, I'll update the existing patches, but haven't found the time so far [13:28:54] ema: ok awesome [13:31:00] vgutierrez: on a more-timely note about ntp: once chromiu+hydrogen are done ugprading, you can patch out some of the jessie-only crap in modules/profile/manifests/ntp.pp [13:31:35] vgutierrez: probably just get rid of $peer_upstreams and the one jessie-v-stretch conditional block, I think. [13:32:01] (but not anything deeper, as the $servers stuff in modules/ntp is still useful in other present/future scenarios) [13:32:04] ack [13:32:58] I guess I should go pretend it's a holiday! :) [13:33:11] oh is it? :) [13:33:15] have fun then! [13:33:23] apparently it's a fake US WMF holiday day heh [13:34:04] https://office.wikimedia.org/wiki/Staff_handbook/Benefits#2018_Holidays [13:34:25] ^ they inject "WMF Holiday" here and there to ensure we get some minimum holiday count that's relatively-evenly distributed through the year [13:34:42] or something [13:36:50] https://en.wikipedia.org/wiki/Emancipation_Day is maybe what they were aiming at, although it's not a nationally-consistent one [13:37:07] Washington DC observes it on this day, but e.g. TX where I'm at observes it on June 19th [13:37:49] it's also Patriots' Day apparently [13:38:46] funny.. dns4* is not able to peer with eqiad ntp servers [13:39:27] I wouldn't say "not able to", but.... ntp is complicated, especially with maxclock and a large and variable count of pool-servers + peers [13:39:34] it's more like they decided it's not necessary at present [13:39:47] at least currently :) [13:39:50] chromium.wikime .INIT. 16 s 34d 64 0 0.000 0.000 0.000 [13:39:53] hydrogen.wikime .INIT. 16 s 34d 64 0 0.000 0.000 0.000 [13:39:57] that's dns4001 [13:40:07] yeah [13:40:33] getting stuck-in-.INIT. like that isn't unusual though, for "peer" when there are other valid peers and servers that look better. [13:41:07] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133555 (10Marostegui) As another test to discard issues - I have rebooted db1114 with an older kernel. So it is now running ``` root@db1114:~# uname -a Linux db1114 4.9.0... [13:41:33] sometimes it eventually resolves. sometimes restarting a relevant daemon helps, but that also hurts in other ways (best do it far apart from all the other ongoing work) [13:42:14] note that chromium does see dns4* as valid peers [13:42:34] and acamar shows dns4* as .INIT [13:42:37] so, yeah :) [13:43:13] those things would probably at least *look* saner if we had a structure that had a lower count of "peer" entries heh [13:44:10] on the other hand, if you see WMF peers that are stuck in other states than .INIT., like .XFAC. or .STEP., that probably means something needs restarting (probably the local daemon reporting such states) [13:45:35] ntp is an ugly thing in general [13:46:00] dns4002 is on .STEP. for eqiad peers [13:46:07] ah [13:46:14] hmm btw, how other boxes get their clock synced? [13:46:38] you mean other boxes aside from our set of actual ntp servers? [13:46:48] indeed [13:48:01] so, for not-ntp-server machines, they include modules/standard/manifests/ntp.pp which gives them a list of our ntp servers per-DC [13:48:33] and then at least for jessie+ (but not ancient trusty hosts), modules/standard/manifests/ntp/timesyncd.pp pulls that data together to configure the systemd ntp client [13:48:48] so client machines in core sites hit all 4 servers at both core sites [13:49:05] clients at non-core sites hit their local 2x servers, plus the 2x at the closest core site [13:49:30] I need to level up my systemd game... [13:50:02] the few remaining trusty hosts don't have systemd and its ntp client junk, so they use traditional ntpd in client-only mode using a similar config from: modules/standard/manifests/ntp/client.pp [13:51:12] I kind of like the way the client configs are currently (2x local + 2x closest core, interpreting that as all 4x core servers for core-dc clients) [13:51:54] and I think it jives pretty well with a reduced-peering idea for the ntp servers' configs, where we mostly rely on the synchronicity of external public pools under normal conditions. [13:52:31] (in other words, servers only "peer" with the opposite local server, and use "server" as a fallback towards the core DCs) [13:52:53] maybe still keep full peering between the 4x servers at the core DCs, I donno, it needs thinking [13:57:03] (or we just go crazy and put GPS antennas and rubidium local clocks in all the DCs to throw $$+hardware at it, but that seems kinda crazy for our needs!) [13:57:25] anyways, off to holiday-mode for real this time [13:57:30] enjoy bblack! [14:00:16] gehel: let me know when you are ready to deploy wqds-internal :) [14:01:15] *wdqs [14:02:31] vgutierrez: I'm as ready as I'll ever be :) [14:02:43] awesome [14:02:50] I need this merged: https://gerrit.wikimedia.org/r/#/c/424599/ [14:03:15] to be able to run puppet on the lvs instances and get the new config [14:03:52] vgutierrez: I think we need to merge the DNS change first [14:04:01] https://gerrit.wikimedia.org/r/#/c/424587/ [14:05:15] vgutierrez: if you are ready, I'll merge that DNS change and run authdns-update [14:05:20] sure [14:05:21] go for it [14:06:33] authdns-update fails... [14:06:41] error: Name 'wdqs-internal.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-wdqs-internal' [14:07:28] that's defined here: https://gerrit.wikimedia.org/r/#/c/424599/2/conftool-data/discovery/services.yaml [14:07:43] Oh, so actually, the puppet change needs to be merged first? [14:08:35] it looks like it, you got the newbie to assist in the deployment :) [14:08:40] vgutierrez: you probably understand the dependencies better than I do... [14:09:19] ok, unless you have an objection, let's merge the puppet side and that should get things corrected... [14:09:28] no objection at all [14:09:44] on the lvs side nothing will happen till pybal is manually restarted [14:10:20] and it is a new service... so we're not going to break it :) [14:10:31] let's just try to not break something else [14:11:44] yeah, conftool-data/discovery/services.yaml needs to be merged first IIRC [14:12:20] <_joe_> gehel: yes you need to merge the puppet part first [14:12:32] <_joe_> not just the conftool-data one though, there is also in hiera [14:13:26] _joe_: hiera = https://gerrit.wikimedia.org/r/#/c/424599/3/hieradata/common/lvs/configuration.yaml or is there something else we missed? [14:13:43] <_joe_> gehel: I was talking about the discovery dns part [14:14:06] _joe_: ok, so we probably missed something... let's hold that... [14:14:07] <_joe_> hieradata/common/discovery.yaml [14:14:36] <_joe_> you need to reference the lvs name, and declare which kind of service are you working on [14:14:48] _joe_: right, I see it! [14:15:06] let me add a commit and puppet-,merge all that [14:15:29] <_joe_> the correct order would be: 1 - conftool-data + hieradata 2 - dns [14:15:37] <_joe_> sorry [14:15:48] <_joe_> 1.5 - run puppet on all authdns hosts [14:15:56] _joe_: no problem! I'll try to document that once it is done... [14:17:11] _joe_, vgutierrez: https://gerrit.wikimedia.org/r/#/c/426926/ [14:22:25] vgutierrez: ok, authdns-update now works [14:22:29] cool [14:22:33] _joe_: thanks for the hints! [14:23:33] I already got the config on lvs2006 and lvs1006 (secondary LVSs) [14:23:58] so let's restart pybal on lvs2006 and let's see how it behaves :) [14:24:15] oh I probably gave bad advice, re: 2x commits. probably the only sane way to deploy a new dns-disc without running into failures or tricky hurry-up deploy timing, is to split it in 3 commits [14:24:22] re wdqs-internal [14:24:38] as in, do a dns commit with just the mock part, then the puppet commit, then the rest of the dns commit [14:24:57] but if you just get it all pushed and run puppet stuffs enough, it will all come together eventually [14:25:11] we need some more docs :) [14:25:29] (if you did just the puppet part first and then delayed on doing any dns change containing at least the mock part, then authdns-update would fail for other dns commits that are unrelated, until you finish) [14:25:37] DNS tested, looks good [14:26:54] puppet needs no be run on confd instances I guess [14:26:55] Apr 16 14:26:39 lvs2006 pybal[29831]: [config-etcd] INFO: connected to etcd://conf2001.codfw.wmnet/conftool/v1/pools/codfw/wdqs-internal/wdqs/ [14:26:58] Apr 16 14:26:39 lvs2006 pybal[29831]: [config-etcd] ERROR: failed: [Failure instance: Traceback (failure with no frames): : 404 Not Found [14:27:02] Apr 16 14:26:39 lvs2006 pybal[29831]: ] [14:27:05] :) [14:27:13] puppet-merge is supposed to take care of that [14:27:21] may be some other issue here, re: naming consistency? [14:27:45] checking... [14:28:24] (maybe wdqs_internal vs wdqs-internal? random guess, probably not) [14:28:35] actually I'm not seeing wdqs-internal ehre: https://config-master.wikimedia.org/pybal/codfw/ [14:29:47] If I understand correctly, it shoudl come from https://gerrit.wikimedia.org/r/#/c/424599/3/hieradata/common/lvs/configuration.yaml, which looks correct to me... [14:30:09] but I've probably been staring at it for too long to spot any typo [14:32:49] yeah I don't see wdqs-internal in conftool/etcd at all [14:32:55] yup [14:32:59] puppet-merge of the puppet commit didn't report some error in the final part about etcd stuff? [14:33:57] I just synced it [14:34:00] data exists now [14:34:09] 2018-04-16 14:33:45 [INFO] conftool::load: Syncing static object service:wdqs-internal/wdqs [14:34:13] 2018-04-16 14:33:45 [INFO] conftool::load: Adding objects for node [14:34:15] 2018-04-16 14:33:45 [INFO] conftool::load: Creating node with tags codfw/wdqs-internal/wdqs/wdqs2004.codfw.wmnet [14:34:16] interesting, yes, puppet-merge did give me a warning [14:34:18] etc... [14:34:18] 2018-04-16 14:19:25 [WARNING] etcd.client::_check_cluster_id: etcd response did not contain a cluster ID [14:34:18] 2018-04-16 14:19:25 [ERROR] conftool::load: Loading of data for entity service failed: Backend error: The request requires user authentication : Insufficient credentials [14:34:27] sudo -i ? [14:34:30] yeah sudo -i [14:34:36] yeah... [14:34:42] in general, over time a bunch of our random $tools that require root end up requiring sudo -i [14:34:54] it's such a persistent issue I give up tracking which, I just always use sudo -i for everything [14:34:59] and don't forget to review the output of puppet-merge... [14:35:19] bblack: thanks for the help! [14:35:21] anyways, I did the conftool-merge part under sudo -i already, so can move on [14:36:16] awesome [14:36:25] now we've https://config-master.wikimedia.org/pybal/eqiad/wdqs-internal and https://config-master.wikimedia.org/pybal/codfw/wdqs-internal [14:36:28] :) [14:37:02] vgutierrez: checking icinga, there are a few criticals about PyBal connections to etcd (lvs[12]00[36]) [14:37:05] is that expected? [14:37:16] gehel: indeed [14:37:30] on lvs2006 it should be fixed [14:37:37] the others will get fixed when pybal is restarted [14:37:46] ok, so everything else looks good in icinga... I'll do some testing of the new endpoints see if it works as expected [14:37:54] well [14:38:12] on confd I'm seeing all servers marked as depooled [14:38:26] vgutierrez: yep, new service, they are depooled by default [14:38:34] so on lvs2006 we've the new service IP on ipvsadm but without any backend :) [14:39:52] vgutierrez: they should be pooled now [14:40:08] they are, yes [14:40:28] right [14:40:43] and the BGP route is also there [14:40:43] 10.2.1.41/32 *[BGP/170] 00:04:21, MED 100, localpref 100 [14:40:43] AS path: 64600 I, validation-state: unverified [14:40:44] > to 10.192.17.6 via ae2.2018 [14:40:58] and ipvsadm looks good on lvs2006 [14:41:02] tpy [14:41:04] *yup [14:41:23] I'm going to restart lvs1006 pybal as well [14:41:35] ok [14:43:23] and both wdqs-internal.svc.{eqiad|codfw}.wmnet answer correctly to simple queries [14:43:36] pybal on lvs1006 looking good [14:43:41] ipvsadm looking good as well [14:43:50] let me check BGP, but it's going to be OK :) [14:44:26] discovery endpoint also looks good [14:44:42] 10.2.2.41/32 *[BGP/170] 00:01:56, MED 100, localpref 100 [14:44:42] AS path: 64600 I, validation-state: unverified [14:44:42] > to 208.80.154.139 via ae2.1002 [14:44:44] right :D [14:46:23] * gehel understands mostly nothing about BGP [14:46:45] vgutierrez: why is there a reference to 208.80.154.139 ? [14:47:33] <_joe_> uhm [14:47:51] that lvs1006 injecting the route [14:47:54] <_joe_> I'm not sure curl https://config-master.wikimedia.org/discovery/services.yaml looks right [14:48:18] hmmm looks weird [14:48:33] <_joe_> no it's actually correct, scratch that [14:48:43] ack [14:48:48] _joe_: ack [14:49:17] so.. going to restart pybal on lvs2003, traffic is going to get diverted to lvs2006 for a jiffy [14:49:40] vgutierrez: ack [14:51:43] all good.. wdqs-internal it's being anounced by lvs2003 and lvs2006 for codfw [14:53:01] let's finish up, restarting lvs1003 [14:54:24] gehel: all good on pybal / BGP side :) [14:54:30] vgutierrez: kool! Thanks for all the help! [14:54:40] np [14:54:58] all looks good on my side. I'll do some more testing before sending real traffic there, but we're moving forward nicely! [14:55:13] great [14:55:22] let me know if you need anything else from our side [15:04:55] 10Traffic, 10Operations, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133869 (10Vgutierrez) [15:05:51] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4133871 (10ema) >>! In T187014#4129884, @Nuria wrote: > +1 let me know when it is in place and i can help check things square again on my... [15:06:02] bblack was right (as usual).. it looks like ntp start-up dependencies aren't being handled properly [15:07:01] with hydrogen.. I restarted the host after puppet was clean, and even in that case, I needed to restart ntp manually 10 minutes after [16:11:32] 10netops, 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4134090 (10Marostegui) [16:14:53] 10Traffic, 10Analytics, 10Analytics-Cluster, 10Operations, and 2 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#4134111 (10mforns) 05Open>03Resolved a:03mforns [16:26:10] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#4134131 (10mforns) p:05Normal>03Low [19:59:44] 10Traffic, 10Analytics, 10Operations, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#4134556 (10Ottomata) Hm, I just thought about this a little bit, and I'm not so sure we should do it. The hiera in... [20:04:00] 10Traffic, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4134566 (10Imarlier) a:03BBlack Brandon - No further action for Performance on this. I'm assigning to you to close out or for further investigation, i...