[00:22:18] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991510 (10ayounsi) [00:28:36] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#3991533 (10ayounsi) [03:24:10] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3991730 (10Volker_E) @Dzahn So it would be a request similar to @bmansurov's on RLP about cloning the corresponding GitHub repo? @... [08:59:15] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3992080 (10ema) [08:59:18] 10Traffic, 10Operations, 10Pybal, 10monitoring, 10Patch-For-Review: Icinga check for pybal HTTP connections to etcd - https://phabricator.wikimedia.org/T170847#3992078 (10ema) 05Open>03Resolved a:03ema [09:18:18] root@lvs1006:~# journalctl -u pybal --since=today | grep bgp | tail -n 1 [09:18:21] Feb 22 09:13:15 lvs1006 pybal[30239]: [bgp] INFO: State is now: OPENSENT [09:18:37] mark: that's not ok, right? We should go past OPENSENT and eventually reach ESTABLISHED [09:19:28] I've noticed this while upgrading pybal on the LVSs, lvs1006 is the only one stopping at OPENSENT [09:20:18] not upgrading lvs1003 (last pybal left to be upgraded) for now [09:58:32] <_joe_> uh [09:58:36] <_joe_> that's indeed bad [10:47:44] <_joe_> ema: let's upgrade confctl on the cp* machines today? [10:51:28] _joe_: OK, let's start with pinkunicorn, then eqsin as a testbed [10:51:37] <_joe_> ok [10:51:55] <_joe_> !log upgrading python-conftool on cp1008 [10:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:34] <_joe_> done [10:52:40] <_joe_> what should we test? [10:53:30] not much on cp1008 I guess, in eqsin we should test a depool/pool cycle and see that pybal does the right thing [10:55:34] _joe_: upgrading conftool on cp5007 [10:55:54] !log upgrading python-conftool on cp5007 [10:55:55] <_joe_> ema: we should test a reboot too, maybe? [10:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:10] <_joe_> wow I never logged into an eqsin server [10:57:01] :) [10:57:19] <_joe_> done! :) [10:59:02] _joe_: depool/pool looks good, double-checked on cp5007/lvs5001 [10:59:54] <_joe_> cool [11:00:24] _joe_: I don't think we need to test a reboot, traffic-pool.service uses /usr/local/bin/{depool,pool} which I've just confirmed are working as expected [11:00:36] <_joe_> uhm, ok [11:00:58] <_joe_> so what's next? cumin 'cp*' 'apt-get -y install python-conftool' ? [11:02:16] it would be nice to ensure we avoid races with the cron scheduled varnish-backend-restarts [11:02:28] I'm not sure how to do that properly though [11:03:24] we'd need /usr/local/sbin/run-no-cron :) [11:03:28] <_joe_> ahahah [11:04:00] <_joe_> ema: well the running script wouldn't have problems [11:04:17] <_joe_> you would need to catch somethign starting exactly while the files are being replaced [11:04:25] right [11:04:26] <_joe_> let me calculate the probability of that [11:04:41] :) [11:04:42] <_joe_> how often are backends restarted? [11:05:34] <_joe_> it's 102 hosts, let's say the window for the race condition is 1 second [11:05:49] each backend gets restarted once per week [11:06:39] <_joe_> so the probability is 102/(7*86400) [11:07:22] <_joe_> less than 0.02% [11:07:47] and you'd still see that in production, MurphyTM [11:08:05] <_joe_> vgutierrez: I was about to say "that's not a risk we want to have" [11:08:12] :) [11:08:25] <_joe_> so I'll start the script not at 00 seconds on the minute [11:08:42] <_joe_> which is when cron runs, IIRC [11:10:36] <_joe_> test `date +%S` != 00 && ... [11:10:39] <_joe_> SAFE! [11:10:43] gh [11:12:03] <_joe_> ema: should I do it? Upgrade everywhere? [11:18:05] _joe_: yes, there's no scheduled restart at 11 [11:18:08] sudo cumin 'A:cp' "awk '\$1 == 11 { print \$1, \$2 }' /etc/cron.d/varnish-backend-restart" [11:18:18] <_joe_> ahahahahahaha [11:18:20] :) [11:19:34] ema: $1 are not the minutes? [11:19:35] <_joe_> !log upgrading python-conftool on all cache hosts [11:19:45] <_joe_> yes [11:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:52] <_joe_> don't tell him! [11:20:23] shit! :P [11:21:25] and indeed at 11:20 we did have a restart scheduled [11:21:28] pinkunicorn :) [11:23:19] _joe_: what's changed in the new conftool version btw? [11:23:53] <_joe_> ema: as far as your use is concerned, not much [11:24:03] <_joe_> we dropped confctl --find [11:26:54] uh, gotta run for lunch [11:26:57] see you later [13:17:53] back to the bgp issue on lvs1006, I've checked on cr2-eqiad and the bgp session seemed to be established properly [13:18:03] ema@re0.cr2-eqiad> show bgp summary | match 208.80.154.139 [13:18:09] 208.80.154.139 64600 3 3 0 82 14 Establ [13:18:25] so pybal is failing to report it properly? [13:18:38] tried stopping pybal on lvs1006, the session on cr2-eqiad went into Active: [13:18:45] 208.80.154.139 64600 1435 1567 0 82 24 Active [13:19:16] and then again I've started pybal, this time we got to ESTABLISHED on the pybal side too [13:19:23] Feb 22 13:17:03 lvs1006 pybal[11183]: [bgp] INFO: State is now: ESTABLISHED [13:20:33] so yeah, to the best of my understanding the bgp session was established correctly but pybal failed to log the transitions to OPENCONFIRM and ESTABLISHED [13:21:06] upgrading lvs1003 now [13:23:16] https://github.com/wikimedia/PyBal/blob/master/pybal/bgp/bgp.py#L1043-L1046 [13:23:23] upgrade finished, all good [13:24:09] if pybal for some reason missed the OPENCONFIRM that would explain why it didn't switch to ESTABLISHED [13:34:24] it's more likely pybal doesn't think it was established [13:34:43] (as valentin says, really) [13:34:55] so pybal and the router may not agree [13:35:19] I was checking pybal bgp.py.. it needs some serious love :) [13:36:46] lvs1003 has the following icinga error now after pybal restart: [13:36:48] CRITICAL: 38 connections established with conf1001.eqiad.wmnet:2379 (min=41) [13:37:02] which confirms that the icinga check works fine [13:37:41] diffing the etcd connection logs with pybal.conf's config statements: https://phabricator.wikimedia.org/P6730 [13:39:25] so it never established those connections after the restart? [13:39:48] that's what it looks like, yes [13:40:19] could be stuck here: https://github.com/wikimedia/PyBal/blob/master/pybal/etcd.py#L123 due to timeout = 0 (https://github.com/wikimedia/PyBal/blob/master/pybal/etcd.py#L108)? [13:41:06] !log bounce pybal on lvs1003 to try establish missing etcd connections (zotero, thumbor, wdqs) https://phabricator.wikimedia.org/P6730 [13:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:04] heh, restarting pybal fixed that too [13:42:21] well.. twisted connectSSL has a default 30 secs timeout [13:42:39] too high IMHO, but still better than 0 [13:42:53] https://github.com/twisted/twisted/blob/twisted-17.9.0/src/twisted/internet/interfaces.py#L780 [13:44:42] we've got two new issues (at least they're new to me): pybal disagreeing with the router on the bgp state, with the router being in state established and pybal opensent [13:44:47] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3992675 (10Dzahn) @Volker_E Yes, that's right. Similar to to bmansurov's. You would ask for it to clone from Github while i can te... [13:45:16] and pybal not establishing all required etcd connections at startup [13:45:38] I'll file tasks later on [13:50:17] bblack: does gdnsd and/or our CI validate the 'target' of an SRV entry? [13:50:43] I'm wondering if by mistake we could have an SRV entry that points to a non-existent hostname [13:57:21] volans: it only checks whether the target hostname exists/has-IPs if it's in the same zone as the SRV. And then there's flags/settings about whether that's a warning or a fatal, so I'd have to double-check CI. [13:58:35] yeah we do set the flag to make that fatal for CI [13:59:13] bblack: great! thanks for the check [13:59:45] I only see one case where we use SRV with cross-zone targets (which CI can't check) [13:59:48] templates/wikimedia.org:_x-puppet-ca._tcp 5M IN SRV 0 1 8140 puppetmaster1001.eqiad.wmnet. [13:59:51] templates/wikimedia.org:_x-puppet._tcp 5M IN SRV 0 1 8140 puppetmaster1001.eqiad.wmnet. [14:00:46] I was mostly interested for the etcd one, as one of the thing I listed for the mediawiki use of etcd was to check how it behaves [14:00:55] gdnsd used to do those kinds of checks across all local zones, long ago [14:00:56] if an SRV has a target that returns nxdomain [14:01:38] and then I added what I thought was the awesome feature at the time, to allow all zonefiles to reload asynchronously from each other and all that, which made it impractical to do cross-zone data checks. [14:02:37] for 3.x I'm removing that set of features and going back to the model where the whole set of zonefiles is one things that loads transactionally, and one of the positive fallouts is we can go back to doing cross-zone data checks. [14:02:59] oh, nice [14:03:22] (a lot of the preliminary 3.x work is about gutting complex code I wrote to support some feature that I later decided was a bad idea) [14:03:42] in loss-of-compatibility terms it will be a quite Major version bump in that sense heh [14:03:56] eheh [14:05:14] and for my second doubt of the day, regarding the discovery DNS stuff, what's our mid-term plan? Should any new added service be configured with discovery dns? Also internal-non-user-facing ones? [14:06:17] yeah I've avoided that topic. I'm not convinced yet that discovery-dns is a great long-term target in the first place. I view it more as a hack to get us through a transitional period that may last a while. [14:07:36] it seems like in the eventual/ideal state, we'd rather have something more like static geodns at best (e.g. appservers.svc.wmnet resolves to appservers.codfw.svc.wmnet if the requesting cache/client is in codfw, and the opposite). [14:07:55] but not the switching part? [14:08:25] hello people, as FYI we'd like to migrate varniskafka misc webrequest's traffic to Kafka Jumbo later on [14:08:37] anything against it? [14:08:41] I see, makes sense (re: bblack) [14:08:43] or rather, that could also be configuration-driven rather than dns-driven, too [14:09:01] elukey: you already have it sending both streams, right? [14:09:21] bblack: yep, checked rtts and errors, it has been good so far [14:09:21] elukey: [OT] jumbo is missing here https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions ;) [14:09:46] elukey: so what actually happens in today's migration? [14:10:31] bblack: we have been using a testing vk instance for misc -> jumbo up to now, we'd like to remove it and point the webrequest one to jumbo (currently still pushing to analytics) [14:10:46] only for misc of course [14:10:59] it is the first step of the migration [14:11:17] eventually all webrequest/eventlogging/statsv vk traffic should go to Jumbo [14:11:47] ah I guess I misunderstood. when you "remove it an dpoint the webrequest one to jumbo" isn't this effectively the same in practice as just killing the old one and leaving the new one running? [14:12:00] or I guess they were using some test topic? [14:12:08] (but same input data) [14:12:48] yep test topic on Jumbo for the testing vk instance [14:13:51] volans: re: mid-term, I donno. I think it's more a standards/policies-driven question about new services, and I don't know that we have such a standard/policy, e.g. "all new production service deployments must be multi-dc" (and in the way we mean it, where it's active/active, either side can handle the full load, and maintenance plans do not require completely shutting off either side) [14:14:51] if we knew that was the requirement, and disc-dns is the only way to configure such a thing, then I guess the answer to your earlier question would be obvious [14:15:04] elukey: +1 [14:15:34] bblack: yeah, in my case it will be a small internal non-public-facing one, so I think I'll stick with the standard way [14:15:47] we don't do disc-dns with public-facing services in general [14:15:49] but I was curious on the general case and if we did set some policy or not [14:15:57] well I guess it depends on definitions [14:16:23] but the whole point of disc-dns is for our internal traffic. the outside world never sees disc-dns directly. [14:17:17] and whether or not we have a hard policy in place yet, obviously the overall goal here is to make everything multi-dc. so everything is eventually a target for conversion to such a scheme. [14:18:06] your new "small internal non-public-facing" service, does it matter that it's not multi-dc? is someone/thing going to care when we switch to codfw and it doesn't work? [14:18:16] sure, I meant the other public definition :D (accesible by anyone) [14:18:42] it's a web ui for puppetdb, as part of the puppet goal, I will install anyway it in both DCs [14:19:01] puppetdb is already multi-dc isn't it? [14:19:09] (in some sense?) [14:19:10] probably in an active/passive way (just because the underlying postgres is already active/passive) [14:19:13] yep [14:19:33] so clearly in a real-failover scenario, switching which is the active postgres must be one of our steps, right? [14:20:03] and I'd guess the web UI is more-or-less stateless aside from connecting to puppetdb? [14:20:28] in which case you could deploy it active/active at both sides, and just have both of them rely on the switch to say which puppetdb to connect to. [14:20:37] indeed, I just need to check how it will handle the users, but should be stateless or state-that-can-be-reconstructed-on-the-fly [14:20:40] (assuming the PG connection is secure...) [14:21:04] it will connect to puppetdb (the nice java API) [14:21:08] so no direct to postgres [14:21:17] ah [14:21:32] well, so that should failover automagically when we switch active PGs, right? [14:22:05] I'm just thinking, if there's a way you can rely on the existing abstraction/switching of active postgreses, there's no reason not to deploy this as an active/active web UI [14:22:12] yeah I was thinking to just connect to the local pupeptdb api and be done with it [14:22:55] and you don't need disc-dns for such a case, unless you envision other services consuming this as an HTTP API or something? [14:23:35] (can just make it an active/active user-facing web service via cache_misc and a public puppetdbui.wikimedia.org hostname or whatever) [14:24:53] yeah, what I thought, I'm still unsure on the active/active, need to check if the puppetdb API is or can be active/active [15:26:17] !log finished upgrading cache_text@ulsfo to varnish 5 [15:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:33] ema: did https://etherpad.wikimedia.org/p/vk-jumbo-cleanup on cp1008, if it is good it can be extended also on all cache misc [15:33:45] ema: I was just mentally rewinding to another experiment, all the numa stuff [15:34:14] I think last we touched on that, basically the isolate stuff in its current form on current hardware seems like a Bad Idea, it causes bad effects that don't seem easy to fix. [15:34:29] but we left "numa_networking: on" for cp4021 as an experiment. [15:35:06] (which is the lesser variant. it pins nginx to node0 where the network adapter is, and also changes the IRQ routing stuff to only use node0 cpus) [15:35:33] (but doesn't pin memory, and doesn't exclude anything else from node0) [15:36:41] looking at e.g. the past week of upload@ulsfo (cp4021-26) in varnish-machine-stats, I think it's observable that cp4021 seems to have healthier patterns than the rest. [15:37:01] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3992883 (10Volker_E) @Dzahn Just to be explicit about our latest structure. We plan to have an index page in design.wikimedia.org... [15:37:24] the most-notable place you see graph differences are in the cached-memory stability: [15:37:28] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4021&var-datasource=ulsfo%20prometheus%2Fops&from=now-7d&to=now&panelId=4&fullscreen [15:37:38] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4021&var-datasource=ulsfo%20prometheus%2Fops&from=now-7d&to=now&panelId=4&fullscreen [15:38:02] ^ the spikiness there in 4023 (and others) isn't present on 4021 [15:38:17] but there are similar more-subtle effects elsewhere in various stats, too [15:38:54] and AFAIK we haven't seen any explicable cp4021-specific problems, in the ~4-5 months it's been running that way [15:39:02] s/explicable/inexplicable/ :) [15:39:43] so I think this all argues that we should push "numa_networking: on" further to more dcs and clusters. [15:40:39] but I don't know it will be a smooth easy puppet rollout, either. it changes nginx in a way that requires a true restart there (not upgrade), and it changes interface-rps config requiring a re-run of that (which I'm not sure is automatic, and might be disruptive if it is) [15:41:20] I'll test later today at some point on cp5 and see how easy it is to manage with depools or something [15:41:51] (oh apparently both links above are cp4021, this was the cp4023 comparison): [15:41:54] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4023&var-datasource=ulsfo%20prometheus%2Fops&from=now-7d&to=now&panelId=4&fullscreen [15:43:52] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10bmansurov) Unsubscribing, just to keep the noise down. Ping me if you need anything from me. [16:53:09] ema: I peeked at vmod_netmapper for the first time in a while. it's actually kinda scary, I wouldn't be surprised at init-vs-fini bugs :) [16:53:19] right :) [16:53:49] especially considering it's spawning a thread varnish doesn't know about and the cleanup hack, etc [16:54:32] here's the panic I've triggered with vcl.discard: https://phabricator.wikimedia.org/P6731 [16:56:44] yeah... [16:57:00] pile it on the TODO list :) [16:57:12] cp5002's frontend has 27 cold VCLs, it should be easy to get a repro there [16:57:44] ok [17:09:23] so, discarding one cold vcl didn't crash anything [17:09:31] on cp5002 [17:10:56] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3317112 (10Deskana) [17:13:38] varnishadm -n frontend vcl.list | awk '$2=="auto/cold" { print $4 }' | while read h ; do varnishadm -n frontend vcl.discard $h; done [17:13:54] this did :) [17:15:10] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3993367 (10Jdforrester-WMF) [17:16:41] restarting varnish-fe to avoid icinga spam [17:17:53] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3993403 (10Jdforrester-WMF) [17:20:48] gotta go, cya [18:31:04] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3993807 (10Dzahn) >>! In T185282#3992883, @Volker_E wrote: > We plan to have an index page in design.wikimedia.org and the style g... [18:33:31] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3993819 (10Volker_E) @Dzahn Ok, good to know. Basically, it needs to be two repos, while the one (style guide) is specifically tar... [19:58:19] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3994134 (10Dzahn) @Volker_E just to be clear, 2 different repos that are both on Gerrit (from puppet's point of view). Whether you...