[04:44:29] 10Traffic, 10Operations, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) So it looks like we are still leaking memory with ATS 8.0.5-1wm6: ` (gdb) bt #0 0x00002adfd1a16fff in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0... [06:46:40] vgutierrez: hello! [06:47:03] hi there [06:47:14] can we merge that patch? [06:48:16] yeah.. let's go for it [06:48:33] I'm gonna go crazy if I keep staring to ATS memory usage dumps [06:48:33] alright! [06:48:54] memory dumps will definitely make you go crazy :) [06:48:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/529053 right? [06:49:22] yep [06:49:23] that one [06:51:30] onimisionipe: so.. I was running a sanity check... [06:51:43] and I think that your PR is affected by this as well: https://gerrit.wikimedia.org/r/c/operations/puppet/+/534462/1/hieradata/common/lvs/configuration.yaml#905 [06:53:01] vgutierrez: not sure I get [06:53:50] Oh [06:53:52] I get [06:53:55] wdqs-heavy-queries is using the same IP block as wdqs [06:54:02] and its defining an icinga stanza as well [06:54:09] an according to akosiaris that's not supported [06:54:25] hmmm [06:54:39] so it looks like you should drop the icinga stanza for the heavy-queries service [06:54:54] gimmie some min to confirm [06:55:04] sure :) [06:56:49] heh, we had not introduced an LVS service reusing an IP in years and now we 've done it in the course of 3 months like 5 times [06:56:58] times change [06:57:22] maybe it's about time I had a look at the mess of a data structure and try to rework it in something less sinister [06:58:50] so I see IP reuse for search @9443, 9243 and 9643 and they all define icinga checks [06:59:29] overall what will happen is currently undefined, maybe all icinga stanzas are the same? [06:59:45] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/lvs/configuration.yaml#542 [07:00:58] * akosiaris looking [07:04:16] onimisionipe: that behaves as expected on the icinga side? [07:04:21] I see [07:04:26] * vgutierrez checking as well [07:04:28] vgutierrez: It doesnt apparently [07:04:40] the last one over writes the other ones [07:04:40] doesn't look like it works [07:04:48] :) [07:04:51] 9243 does not exist for exactly [07:05:03] 9243 check does not exist for that LVS service [07:05:07] and I guess so on [07:05:18] but 9263 exists [07:05:31] Ok [07:05:39] lol [07:05:49] note btw that now we at least have stable ruby hashes [07:05:49] I will fix this patch and submit a cleanup patch for search [07:06:10] *search ones [07:06:20] I am betting that back when puppet was powered by Ruby 1.8 that would change on every puppet run [07:06:36] interesting [07:06:49] not sure if we did the same for wdqs [07:07:04] thanks akosiaris, vgutierrez ! [07:07:12] np [07:07:33] and sorry for exposing you to that mess (it's mostly my mess btw) [07:07:49] and it's biting us more and more recently so it's probably about time it's fixed [07:08:23] no p. Its alright [07:34:24] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/529053 [07:34:31] yup, I've seen PS16 [07:34:50] got lost with creating a check for that.. before I realized it should be in another patch [07:36:36] so let's merge it :) [07:36:48] alright! [07:39:23] * vgutierrez disabling puppet on low-traffic LBs.. [07:39:29] done :), let's go [07:41:28] CI is taking its time as usual [07:41:37] alright! [07:42:30] running puppet-merge... [07:43:25] I'm gonna trigger a puppet run on the wdqs servers to let ferm update the FW rules [07:46:01] hmm [07:46:13] the ferm rule doesn't work as expected [07:46:42] wdqs1004.eqiad.wmnet [10.64.0.17] 8888 (?) : Connection timed out [07:46:43] ok [07:47:01] that's lvs1016 [07:49:07] I can see the rule in wdqs1004 [07:51:09] onimisionipe: yep, but it doesn't look like the right IP is whitelisted as part of those rules [07:51:55] 10.64.49.16 that's the main NIC for lvs1016 and is not included there [07:53:38] I see 10.2.2.0/24 for eqiad [07:54:08] is lvs1016 public or private? [07:54:43] ignore [07:54:46] its private [07:54:54] so the LVS has NICs in the public and private subnets [07:55:33] @def $EQIAD_PRIVATE_PRIVATE1_LVS_EQIAD_IPV4 = (10.2.2.0/24); and 10.64.49.16... is the problem [07:57:28] lvs1016 has a NIC on 10.2.2.1 [07:57:48] but of course is not going to use it to reach wdqs1004 cause that's seating on 10.64.0.17 [07:57:55] yep [07:58:28] got a quick fix or let's revert the patch and figure out the ferm rules? [07:58:40] let's revert for now [07:58:47] ack [07:58:48] thanks [07:58:51] np [07:59:34] onimisionipe: so I'd say move the ferm stuff out of this patch and let's get that one merged and checked before the big one :) [08:00:07] Ok [08:01:21] I'm going to make ferm rule DOMAIN network for now [08:48:46] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) @Cmjohnson should the following descriptions be updated as well with their `an-presto` equivalents?... [08:53:53] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10MoritzMuehlenhoff) The old host definitions for cloudviran are still in debmonitor, puppetdb and site.pp are... [09:03:05] 10Traffic, 10netops, 10Operations, 10IPv6, 10Patch-For-Review: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10MoritzMuehlenhoff) I just reimaged mw2231 for unrelated reasons (broken hardware, system got swapped with a different server) and the... [10:34:32] 10Traffic, 10Operations: Cookies and misc services caching - https://phabricator.wikimedia.org/T232453 (10ema) [10:34:39] 10Traffic, 10Operations: Cookies and misc services caching - https://phabricator.wikimedia.org/T232453 (10ema) p:05Triage→03Normal [10:35:08] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) 05Open→03Resolved >>! In T230772#5464981, @Gilles wrote: > This should probably be its own task, though, it's not specific to piwik.js Agreed, I've... [10:35:18] 10Traffic, 10Analytics, 10Operations: Cookies and misc services caching - https://phabricator.wikimedia.org/T232453 (10ema) [11:03:10] mmh interesting, for some reason the weight of cp1075 in etcd is 0. This means that it's currently getting no pass traffic from the frontends [11:03:15] {"cp1075.eqiad.wmnet": {"pooled": "yes", "weight": 0}, "tags": "dc=eqiad,cluster=cache_text,service=ats-be"} [11:03:42] we do have however weight:100 in yaml for ats-be, see conftool-data/service/cache.yaml [11:04:14] I wonder how it happened that weight became 0? [11:05:03] this is only an issue for frontend pass traffic, for which we use the random director [11:05:23] the normal shard director does not use weight at all so no issue there [11:05:46] anyways, setting weight back to 100 [14:06:12] ema: if I had to guess, maybe a misguided attempt to depool during some earlier issue? [14:26:13] I had to depool cp1075 btw [14:26:21] akosiaris: hey [14:26:23] https://phabricator.wikimedia.org/P9072 [14:26:44] turns out the the releases.discovery.wmnet cert does not have releases.wikimedia.org as a subjectaltname [14:26:53] whoops, nice catch [14:27:01] and I thought it best I do not mess with the certificate, but rather depool ats-be [14:27:27] akosiaris: you can definitely update the cert, it should be easy with cergen [14:27:45] ok, lemme rtfd myself a bitg [14:28:16] this is tried and true, to create a new cert from scratch: https://wikitech.wikimedia.org/wiki/Cergen#Cheatsheet [14:28:26] ah, from scratch? [14:28:31] updating should be as easy as modifying the yaml file [14:28:38] and deleting the old files? [14:28:42] or not? [14:29:17] I think not: first update the SAN section in the yaml file, then run cergen -c 'SERVICENAME.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d [14:29:47] there's a caveat though: [14:29:47] > NOTE: If you are regenerating a Puppet signed certificate, you must first remove the certificate from the Puppet CA. puppet cert clean should do it. [14:29:58] ok, makes sense [14:30:16] let's wreak havoc! [14:30:17] * akosiaris doing [14:30:30] * ema grabs the pop-corn [14:32:10] 2019-09-10 14:31:40,206 WARNING Certificate(releases.discovery.wmnet) /srv/private/modules/secret/secrets/certificates/releases.discovery.wmnet/releases.discovery.wmnet.crt.pem exists, skipping generation. [14:32:13] nope [14:32:25] lemme delete just that though, I don't need to regenerate the key [14:32:38] oh and the keystores I guess [14:33:35] success! [14:33:38] \o/ [14:33:51] what did you remove? Just the pem file? [14:34:06] sudo rm releases.discovery.wmnet.crt.pem releases.discovery.wmnet.csr.pem releases.discovery.wmnet.keystore.jks releases.discovery.wmnet.keystore.p12 truststore.jks [14:34:29] essentially left just the ca and the keys [14:34:35] ok [14:34:45] * akosiaris now updates puppet [14:36:23] you might want to disable puppet on releases1001 and test the new cert on releases2001, though if cp1075 there's no TLS client talking to it anyways [14:36:34] yeah that [14:36:36] if cp1075 is depooled, I meant [14:37:30] I love reviewing diffs of certificates [14:37:42] aren't they great! [14:39:22] also please don't forget to commit the changes on /srv/private [14:39:40] done [14:40:03] I also have to read up on envoy [14:40:24] I left for like 3 weeks and I come back and there's a new component to care about [14:40:34] :) [14:40:50] we've had a pretty good experience with it so far [14:41:26] the only issue is that it thinks the 90s are over and responded 426 to http/1.0 requests [14:41:37] lol [14:42:03] fixed saying "accept_http_10" in the config [14:43:31] ok I did find a minor issue. I had to manually restart envoy [14:43:52] it seems to work! [14:44:00] yup, just verified it [14:44:03] repooling cp1075 [14:44:10] ack [14:44:33] ulsfo is being depooled for DC maintenance [14:44:35] ok, great [14:44:38] ema: thanks! [14:44:39] you can follow what's going on on cp1075 with `sudo atslog-backend releases` [14:44:55] and then see what happens to your requests [14:45:05] oh that sounds interesting [14:45:37] ah, that's nice! [14:46:00] "releases" in the command above is a regex on the log, so for instance you can also say `atslog-backend RespStatus:[34]` [14:48:14] alright, it seems that things are working again, thanks a lot for spotting this issue! It showed up only now because varnish-fe pass traffic was wrongly not being routed to cp1075 till today at 11:07 [14:50:26] glad to be of help [14:55:17] akosiaris: does this more or less reflect what you just did? https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate [14:58:18] ema: I 've rephrased to be in imperative mode, but otherwise LGTM [14:58:24] nice! [14:59:10] btw, ru.planet.wikimedia.org seems to have the same issue, per ats's logs [14:59:22] there is also wikipedia.matt.life or something which is ultra weird [14:59:38] some VM on digitalocean, but anyway [15:00:50] I doubt the latter is actually ours? :) [15:01:00] yeah clearly, just spotted it in the logs [15:01:22] some Caddy server proxying to us, for reasons unknown [15:02:22] We had a bug report last week about someone using caddy as a proxy [15:02:35] akosiaris: ah while you were on vacation we also switched from apache to Caddy everywhere, not sure if you've seen that yet [15:02:40] ema: ahahahaha [15:03:45] https://phabricator.wikimedia.org/T232188 Reedy? [15:04:01] That and https://phabricator.wikimedia.org/T232213 yeah [15:26:28] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` ['an-p... [15:34:42] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` cloudv... [15:34:46] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirtan1002.eqiad.wmnet'] ` Of which those **FA... [15:36:04] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` cloudv... [15:44:45] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` cloudv... [16:05:24] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10Reedy) [16:06:15] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previes/diffs - https://phabricator.wikimedia.org/T232491 (10Daimona) [16:06:56] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Reedy) [16:40:57] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-presto1002.eqiad.wmnet'] ` Of which those **FAI... [16:45:45] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-presto1001.eqiad.wmnet'] ` Of which those **FAI... [20:50:04] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Ahecht) https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Cannot_save_edit_on_pages_when_using_cellular_network_with_... [20:58:10] 10Traffic, 10netops, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Ahecht) [22:14:13] 10Traffic, 10netops, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Izno) I have issues saving in Chrome (Win10) on my work computer but no issue at home on Firefox (Win10). I use WTE2017.... [22:34:16] 10Traffic, 10netops, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Addshore) [22:38:51] 10Traffic, 10netops, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10ayounsi) Thanks for the reports, we have narrowed down the cause to a [[ https://en.wikipedia.org/wiki/Maximum_transmiss... [23:17:28] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10observability, 10User-Addshore: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fiel...