[11:15:35] so... I got a nice issue with our lovely ATS TLS termination instance [11:16:00] the bastard segfaults trying to allocate memory after the process reaches 2.7Gb [11:16:28] using the same systemd unit and a stupid memory allocator I can allocate 10Gb successfully [11:16:57] that program is compiled with the same compiler (clang++6.0) and the same compiler flags: -Dlinux -D_LARGEFILE64_SOURCE=1 -D_COMPILE64BIT_SOURCE=1 -D_REENTRANT -D__STDC_LIMIT_MACROS=1 -D__STDC_FORMAT_MACROS=1 -I/build/trafficserver-8.0.5/include -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -DOPENSSL_NO_SSL_INTERN -std=c++17 -g -pipe -Wall -Wno-deprecated-declarations -Qunused-arguments -Wextra -Wno-ignored-qualifiers [11:16:57] -Wno-unused-parameter -fno-strict-aliasing -Wno-invalid-offsetof -mcx16 -g -O2 -fdebug-prefix-map=/build/trafficserver-8.0.5=. -fstack-protector-strong -Wformat -Werror=format-security -O3 -stdlib=libc++ [11:18:11] but the log and the stacktrace shows how ATS fails to allocate 128kb on the last execution [11:18:15] traffic_server[35247]: Fatal: couldn't allocate 128408 bytes [11:19:22] right now the systemd unit is pretty simple [11:19:30] https://www.irccloud.com/pastebin/3DedlCwI/ [11:21:45] so at this point I'm open to crazy ideas [11:21:47] I can have a look later if no one else does- I am not a good person to help, but I battled with large memory allocations for our mysql 512GB hosts [11:21:58] and also with systemd [11:22:11] let me some hours, I am in the middle of something else [11:22:42] jynus: no rush.. It's a bank holiday for me so I don't have a strong SLA today [11:23:19] and of course your input is very welcome :) [13:47:16] chaomodus: what can you tell me about af-nb-db-2.automation-framework.eqiad.wmflabs? It's one of the VMs that my new cumin setup can't contact (and I can't log into it either) [13:48:37] I have the same issue with jmm-buster2.puppet.eqiad.wmflabs and keith-emostash2.logging.eqiad.wmflabs although I can get shell on those (cc moritzm and herron) [13:49:21] andrewbogott: I can remove that vm [13:49:28] oh, that's easy :) thanks! [13:49:54] probably the same for jmm-buster2, let me checl [13:51:25] can go away, I'll remove it in a bit [13:51:38] cool, thanks moritzm [13:54:07] removed [13:56:59] bblack: ah ha, found: https://wikitech.wikimedia.org/wiki/DNS/Discovery#Remove_a_service_from_production [13:57:11] https://gerrit.wikimedia.org/r/c/operations/dns/+/535852 [13:57:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/535855 [13:57:15] look ok? [13:57:24] moritzm: what is the status of buster reimages, I think you mention some issues before last week? [13:57:30] *mentioned [13:59:23] they're all good to good! that was only on Monday morning when I was rebuilding the netinst images to use the new kernel from Buster 10.1 [13:59:34] ah, thanks [13:59:38] did two buster installs already, all fine [13:59:43] cool! [13:59:49] (already with the 10.1 image I meant) [14:03:17] ottomata: I think https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/535669/2/hieradata/common/discovery.yaml needs to be in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/535855/ too [14:25:18] ah ok! [14:34:52] ottomata: are you ready for us to take a stab at merging those? [14:35:00] (do we need to downtime things first?) [14:37:42] (I mean the discovery parts) [15:00:07] bblack: yes go ahead! sorry been in meetings [15:00:11] ok [15:00:24] i could merge if you prefer...(got another meeting now tho that i'm going to lurk in) [15:03:59] I'll merge it up [15:04:34] danke [15:06:48] ottomata: are you also able to take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/535865 i think its fine but worth having your eyes as you are currently working on this [15:09:16] oh oops thank you. [15:09:19] my fault i just removed that [15:09:55] thanks jbond42 [15:09:56] np, thanks [15:13:49] ottomata: ok the first layer of the onion is gone, all the -discovery parts are deployed and went fine. [15:15:19] great [15:15:25] next? [15:15:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/535872 [15:15:37] after 'Downtime the LVS endpoints in icinga' ? [15:15:58] oh i should add the bit that removes the role::lvs::realserver too (?) [15:18:39] bblack: ^ [15:19:14] so, AFAIK there's probably not an ideal way to remove the LVS part that's foolproof [15:19:59] when you remove the IP address defs in hieradata/common/lvs/configuration.yaml while deconfiguring LVS, it will also pull the defs the host use in role::lvs::realserver, so you kinda have to pull that too [15:20:27] but if those puppetize off of the hosts before LVS has caught up with the changes, some LVS healthchecks will fail, but it's the LVS side that needs some manual work post-merge [15:20:59] so probably: yes, include everything about the LVS parts in one commit (including the realserver refs for the IPs on the service hosts) [15:21:24] but before merging, disable puppet on the service hosts, and we'll go deal with the LVS situation manually, and then let the service hosts catch up. [15:21:34] (and then yeah icinga downtimes too) [15:23:50] the time dimension is so often missing from CM systems and the work we do within them (at least at the scale that would allow for clean easy removal of things by just reversing steps for how they were added) [15:24:09] $ensure=>absent hacks help, but aren't the general-purpose solution :/ [15:26:02] ok [15:26:16] so next step is: [15:26:19] downtime icinga [15:26:25] disable puppet on service hosts [15:26:26] merge [15:26:35] you do some lvs puppet runs and whatever else [15:26:39] then we run puppet on service hosts? [15:27:39] basically yeah [15:27:55] the "do some lvs puppet runs and whatever else" part will probably have to be me, there's some tricky bits there. [15:27:59] aye [15:28:04] "you do" :) [15:28:25] you ok to go now bblack? if so i'll downtime stuff [15:28:52] (I hope it goes without saying, but you never know with so much turnover - obviously I detest anything that requires me to make statements like that, it's just our current reality) [15:29:09] (human dependency and undocumented manual work, etc) [15:29:09] aye yeah [15:29:27] ottomata: I'll be ready in ~5 mins or so [15:29:30] oj [15:29:31] ok [15:30:11] ((turnover was the wrong word too, we don't actually have much turnover. just +new peeps :) )) [15:32:05] ready whenever you are lemme know :) [15:36:19] ottomata: ok ready [15:38:39] k [15:38:43] downtiming [15:39:04] ok downtimed bblack [15:39:15] i downtimed the svc hosts and all services on the kafka main nodes [15:39:32] disab ling puppet on kafka main nodes [15:40:31] ok done [15:40:42] bblack i'll let you merge ya? [15:40:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/535872 [15:41:17] ok [15:54:31] ottomata: done on the LVSes [15:54:45] ok! [15:54:50] so run puppet on service nodes ya? [15:55:19] bblack: ? [15:55:22] yeah [15:55:42] ok runnign on kafka-main2001 [15:56:47] i imagine the output will show some removal of the service IP and refresh of wikimedia-lvs-realserver [15:56:59] hm, i didn't see that. [15:57:14] just applied catalog [15:57:43] just in case, i'm going to manually remove the inclusion, rather than just set the conditional to false [15:59:49] peeking at main2001 [16:00:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/535882/1/modules/profile/manifests/eventbus.pp [16:00:32] maybe there is some has_lvs hiera applying i didn't find [16:00:35] yeah does seem to be a puppetization issue there [16:01:11] we'll be removing profile::eventbus altogether in a bit, so messing with it for decom is fine [16:01:23] ok [16:01:45] ottomata: can I kill the eventbus.svc dns records too? https://gerrit.wikimedia.org/r/#/c/operations/dns/+/535883/ [16:01:50] yes [16:03:52] hm still no change bblack [16:04:48] maybe i need to do this part first? [16:04:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/425982 [16:07:55] conftool shouldn't matter to this [16:09:16] it might be somehow being interfered with from the other profiles? [16:09:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/535884 [16:09:31] ? [16:11:27] hm [16:11:51] shouldn't be bblack it is just kafka [16:11:53] and kafka mirror maker [16:11:55] which don't use lvs [16:12:48] oh there it is I bet [16:12:50] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/535884/1/hieradata/role/common/kafka/main.yaml [16:12:58] is probably what's keeping it alive, indirectly [16:13:02] hm ok [16:13:29] also, I vaguely recall one should depool conftool stuff before removing the entries, although I'm not sure if it's a requirement [16:13:41] seems like at worst a sane benign step :) [16:13:51] "PyBal does not automatically remove ipvsadm services once they're gone from configuration. That can be done by hand with ipvsadm" [16:13:51] ? [16:14:02] that's part of what I did earlier [16:14:06] ah ok [16:14:23] i'll just depool on the services nodes ya? [16:14:30] then merge my patch and run puppet? [16:15:16] I was going to check for depool via confctl but I don't even see the objects heh [16:15:39] ha [16:15:48] it's not called eventbus there I guess? [16:16:08] oh I'm looking at the wrong layer, sorry [16:16:43] bblack i so rarely use confctl i think you'd know better than I :) [16:16:55] ah great [16:16:59] ok they're manually all set depooled now [16:17:53] yeah merge the puppet patch, and re-run agent on service hosts again, hopefully that makes the lvs realserver IPs go away on them [16:17:58] (and also cleans up hieradata) [16:18:26] ok [16:19:43] will i need to manually remove the stuff installed by class lvs::realserver ? [16:20:40] huh bblack during conftoo::load during puppet-merge [16:20:50] eqiad/eventbus/eventbus/kafka1002.eqiad.wmnet [16:20:50] 2019-09-11 16:20:03 [WARNING] etcd.client::_check_cluster_id: etcd response did not contain a cluster ID [16:20:51] etcd.EtcdInsufficientPermissions: The request requires user authentication : Insufficient credentials [16:21:03] that was on puppetmaster2002.codfw.wmnet [16:21:09] other nodes seemed ok [16:21:14] going to run puppet-merge again.. [16:21:23] oh right nothign to merge [16:22:29] it was trying to [16:22:29] 2019-09-11 16:20:03 [INFO] conftool::cleanup: Removing node with tags eqiad/eventbus/eventbus/kafka1002.eqiad.wmnet [16:24:03] bblack not sure what to do from here [16:24:19] no change in puppet on kafka-main2001, but that might be due to puppet-merge fail [16:29:06] _joe_: any advice for what to do about ^^^ ? conftool-merge failed on puppetmaster2002 [16:30:02] ahhh maybe i did not sudo -i properly! [16:30:47] yar in meetingg. [16:35:25] ottomata: yeah sudo -i issue [16:35:27] I think I can fix [16:37:24] yeah I think that fixed conftool [16:38:18] either way, conftool doesn't control the wikimedia-lvs-realserver thing on the end service hosts [16:41:48] will see if puppet puts it back if I remove manually (the LVS realserver IP on kafka-main2001) [16:44:42] ottomata: what's the total set of service hosts? kafka-main200[123] + kafka-main100[123], and none need LVS stuff anymore? [16:44:55] (manual removal did work, need to do it elsewhere so we can just move past this) [16:53:05] hmmm I figured out the set I think: kafka-main200[123], kafka-main1001, and kafka100[23], confusingly :) [16:55:16] ottomata: done, I think we're all cleaned up (through what's been merged so far, which is everything discovery and/or LVS related) [17:08:03] <_joe_> ottomata: I was off, and still kinda am, but I gather bblack helped you [17:08:36] ok he did, no worries thank you sorry for the ping! [17:08:51] bblack: sorry wwas in meeting [17:09:03] yes, kafka100[23] are in transition [17:09:08] being replaced by kafka-main* [17:09:16] awesooome [17:09:20] thank you bblack [17:09:35] i'll continue in a bit, the rest will be disabling some monitoring, and then actually removing the service itself [17:09:38] i think i can handle that bit [17:11:42] ok [21:30:52] Any kafka expert around? I was wondering if it was fine to send kafka from POPs to eqiad. I've been told it's not advised to do Kafka over long latency links, but wanted to know more [21:31:08] XioNoX: hi am about to run [21:31:08] but [21:31:17] it is 'fine'. we do it. [21:31:27] e.g. varnishkafka in esams produces to kafka-jumbo in eqiad [21:31:33] the only reason it isn't 'fine' usually [21:31:54] is the long latency is less reliable, and produce requests are more likely to fail [21:32:12] but, as long as the data you are sending isn't for buildilng production apps, it will work [21:32:29] ok, good to know! [21:32:36] it's netflow data [21:33:01] ya sounds fine [21:33:14] we do it for webrequest data so ¯\_(ツ)_/¯ [21:33:25] kind of what kafka-jumbo is for :) [21:33:56] as netflow is not encrypted I have the choice between setting up ipsec, and some udp forwarders in each POPs back to a collector in eqiad that will forward them to kafka [21:34:13] or do the kafka export directly in the pop [21:34:17] so much easier [21:35:30] ya later is good, the kafka produce will be more reliable than the udp ipsec fowarder anyway :) [21:35:45] haha very true [23:32:35] i am sure this is late but i see the IPv6-mapped lines are gone from site.pp and there is now code in the standard class that does it by default. nice! [23:34:36] mutante: that's all jbond42 ! [23:35:00] :)