[08:05:34] do we have any way to get the main IP of an instance on hiera? [08:06:16] without needing to specify the FQDN of that instance of course [08:08:52] right.. $::ipaddress :) [08:16:23] <_joe_> vgutierrez: what do you want to do? [08:16:28] <_joe_> I sense a XY problem [08:16:37] just https://gerrit.wikimedia.org/r/c/operations/puppet/+/539056 [08:16:47] and PCC is happy https://puppet-compiler.wmflabs.org/compiler1001/18554/ [08:17:56] basically what we already do in https://github.com/wikimedia/puppet/blob/production/modules/tlsproxy/templates/localssl.erb#L5 [08:18:07] but without hardcoding the IP fact on the template [08:21:49] _joe_: looks good to you? [08:22:34] <_joe_> yes [08:22:50] nice :) [08:22:58] <_joe_> out of curiosity - why do we use the NIC vs loopback? [08:23:21] :) [08:23:44] if varnish gets the connection via the loopback it adds "127.0.0.1" to X-Forwarded-For [08:23:55] <_joe_> ok [08:23:58] if it gets the connection via 10.x it adds 10.x to X-Forwarded-For [08:24:01] <_joe_> and you want the ipaddress instead [08:24:06] <_joe_> ok [08:24:22] <_joe_> I'm not 100% sure of the implications [08:24:28] and for some reason mw and some parts of our VCLs need the IP instead of 127.0.0.1 [08:24:48] <_joe_> yeah ok, I'm thinking in terms of the system [08:25:12] so... it shouldn't be a problem... the local IP should be mapped in route table as localhost [08:25:23] <_joe_> yeah that's what I was thinking [08:25:49] <_joe_> so you just want to mark it as coming from the ipaddress but it won't really affect what actually happens [08:26:14] indeed [08:26:27] I just need it to make varnish/mw happy [08:27:01] <_joe_> but I'm not 100% sure about it, there might be non-obvious consequences I'm not aware of [08:27:17] <_joe_> anyways, that's what we already do, it's fine of course [08:27:22] yep [08:27:23] <_joe_> I was just curious [11:56:53] _joe_: ok first question, has anyone ever tested restoring an etcd backup? [12:15:52] <_joe_> so, I just noticed in the logs [12:16:13] <_joe_> we have a slowlog of requests to php-fpm exceeding 15 seconds [12:16:22] <_joe_> there aren't many around the time of the incident [12:16:38] I have noticed something else as well, although that is not the issue, irrelevant a bit [12:17:43] https://logstash.wikimedia.org/goto/dc310498068252a9015d363487ba3028 [12:18:29] <_joe_> effie: what should we look for there? [12:18:41] I think it is an error in loghing [12:18:45] if you open each one [12:18:50] there are different errors [12:18:51] _joe_: does the lack of slowlogs suggest that something was deadlocked at the php level vs at the application code level? [12:19:11] <_joe_> cdanis: I would assume the slowness was in one backend [12:19:15] but as I said, it is not relevant to this [12:19:29] <_joe_> that would stall the php-fpm workers sitting idle waiting for a response [12:19:35] <_joe_> hence using little cpu power [12:19:37] but wouldn't a backend being slow mean that they got slowlogged? [12:20:03] <_joe_> no if it's just slow enough [12:20:12] <_joe_> eg a query that returned in 1 ms returns in 1 s [12:20:24] oh sure [12:20:30] Allowed memory size of 692060160 bytes exhausted (tried to allocate 1052672 bytes) [12:20:40] there are a lot of those there effie [12:21:06] since when though [12:21:10] ... with different stack traces ... [12:21:10] <_joe_> define a lot? [12:21:20] I started noticing about 2 weeks ago [12:21:31] _joe_: the freq is not a problem [12:21:37] there are 35 per hour on this one [12:21:44] the issue is the format [12:22:00] <_joe_> are we still talking about the logging issue? [12:22:01] who should I tell [12:22:05] I think yes [12:22:08] I am [12:22:11] <_joe_> ok, I was focusing on the outage [12:22:27] effie: godog/herron/shdubsh [12:22:35] ok tx [12:23:02] <_joe_> cdanis: to answer your question further [12:23:05] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&from=1569413201023&to=1569414093197&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1276 [12:23:11] <_joe_> I don't see anything standing out [12:23:20] <_joe_> besides the reduction in load average maybe [12:23:57] <_joe_> and the dip in the req/s as you'd expect when there is a slowdown [12:24:16] slight dip in network activity [12:24:22] which you would also expect [12:24:33] <_joe_> while you see php-fpm maxing out its children [12:25:00] and the php-fpm 'requests' metric is missing for the interval [12:25:28] <_joe_> cdanis: possibly because of the overload and having all workers busy [12:25:42] <_joe_> sadly the monitoring request is a request to the same pool as the normal traffic [12:25:46] yeah [12:25:53] apcu entries also weirdly flatlined for the period [12:25:53] <_joe_> only way to collect data that's relevant [12:26:04] mostly-flatlined -- slightly different data points [12:26:37] <_joe_> actually there is a fluctuation in activity,, again explained by the response time fluctuations [12:26:47] <_joe_> so one thing I would exclude is a php-fpm level issue [12:26:58] <_joe_> that wouldn't manifest all at the same time on all servers [12:27:16] <_joe_> which is what I have seen so far [12:27:46] <_joe_> so either it's a code path that generates some bug, or I'd epxect the problem to be in memcached or the databases [12:28:21] <_joe_> marostegui: anything particular during the interval 12:05 - 12:10 on the databases? [12:28:49] _joe_: Let me check some graphs [12:29:02] memcache shows an increase in sets and evictions after the event [12:29:39] _joe_: An spike on connections and a reduction of queries per second [12:29:47] Nothing relevant on logstash [12:29:59] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1 [12:29:59] <_joe_> marostegui: on all sections? [12:30:06] yep [12:30:23] that is weird [12:30:24] <_joe_> ok that doesn't tell us much tbh [12:30:35] <_joe_> no it's not [12:31:06] I would expect that a general issue affects all sections, so not that weird [12:31:07] sigh ok yes, I spoke too fast [12:31:12] <_joe_> say there is one specific query that is used by all requests to a decently sized wiki that for some reason returns in 1 second instead than a few milliseconds [12:31:14] if API was not sending requests [12:31:36] <_joe_> then we will have most workers waiting for that query to last [12:31:39] <_joe_> but [12:31:50] <_joe_> if it's across all section that the connections went up [12:32:24] <_joe_> yeah almost all [12:32:38] <_joe_> this means we just had opne connections not doing much [12:32:45] <_joe_> next candidate is memcached [12:33:22] <_joe_> there is one issue with memcached though - it's hard to imagine an issue hitting the api and not the appservers [12:33:32] <_joe_> also I can check the mcrouter logs [12:34:27] <_joe_> no increase in errors on the mediawiki side [12:34:35] <_joe_> not on the mcrouter side [12:34:42] are there any rarely-used backends, ones that only get called in a certain query type or something? [12:35:00] u [12:35:04] <_joe_> no the data are spread with consistent hashing across 18 backends [12:35:23] <_joe_> so one thing I didn't check are the parsercaches and external storage [12:35:33] <_joe_> but those would affect the appservers as well [12:36:54] <_joe_> can someone dig up the relevant section of api.log on mwlog1001 and see if anything stands out? [12:37:34] I can have a go [12:37:40] it is not that long ago [12:41:19] <_joe_> I'm 99% sure this is a recurrence of https://phabricator.wikimedia.org/T192262 [12:44:06] you think php7 would not have the CPU issue _joe_? [12:44:23] <_joe_> cdanis: possibly! [12:46:16] <_joe_> cdanis: I am basing this deduction solely on two data points: the % of those requests vs other requests logged during the outage were up to 10% from 6% of total; and those were the slowest of all requests to return a response during the outage [12:47:29] <_joe_> but it might be very wrong [12:48:09] <_joe_> also the issue began in the same second on various servers [12:48:12] <_joe_> 2019-09-25T12:05:36 [12:48:22] <_joe_> response times start to spike then [12:48:48] <_joe_> and keep climbing for some time [12:54:03] <_joe_> we don't have more info because requests lasted within 8 seconds [12:54:14] <_joe_> and we create stack traces only above 15 [12:54:25] <_joe_> we might want to lower that limit progressively [12:54:34] <_joe_> so that we can see if we gather useful information [13:24:04] I am sure there is a task somewhere for parsing those logs and keeping them afor a coupel of days in a searchable format [13:24:17] I am painfully going through api.log [13:31:42] <_joe_> effie: the best way to do it is to fgrep an interval of time to a smaller file [13:31:49] <_joe_> I'd suggest not more than 10 minutes [13:32:05] I already have a smaller file [13:32:18] <_joe_> then it's pretty searchable with awk / grep :P [13:32:34] <_joe_> now, finding relevant info [13:32:38] <_joe_> that's another story [13:32:42] I am there as well [13:33:04] I am sorting times on each sec [13:33:11] and tail the 10 worst [13:33:11] <_joe_> it might be useful to have a pre-cooked script outputting a few commonly needed stats from a file containing those logs? maybe [13:33:46] so now I have the worst 10 times of each second (not done yet for the 12:05-12:10 interval [13:34:13] so then I can find the log info by searching $timems [13:34:24] at least that is what I came up with [13:34:54] eg [13:34:55] 1971 2331 2473 2762 2799 2983 2994 5920 6965 49717 [14:16:58] apergos: is 'deployment-dumps-puppetmaster02' yours? I'm looking for VMs that I can move without disrupting anyone with downtime. [14:17:06] it is [14:17:09] yes you may move it [14:17:21] also snapshot01 [14:17:44] thanks! [18:14:53] is there a thing in puppet like require_packages but the opposite? [18:16:23] chaomodus: i haven't seen mode. recently i replaced a require_package with normal package{} but ensure => absent for that [18:16:45] ah [18:16:47] seen one [18:16:47] interestig [18:16:54] is there any downside to ensure => present for packages? [18:18:01] it'd be colo to have ensure_packages that works like require_packages but takes a n ensure argument [18:18:03] cool [18:18:08] chaomodus: the difference is require_package installs them before evaluating any other resources [18:18:17] ah [18:18:18] so it depends. some cases you can get dependency issues [18:19:05] yeah I've done ensure=>present in a package resource, sometimes with the package resource having a dependency on an apt repository resource... and then usually with several other resources having a dependency on the package [18:19:09] I think also require_packages doesn't fail if you have several modules installing the same packages on the same host [18:19:12] "equivalent to declaring and requiring the package resources" [18:21:01] hrm interesting [18:21:56] there is also ensure_resource in stdlib [18:22:08] maybe that with type package and absent? [18:22:13] bbiaw.. train arrived [18:22:45] oh that's cool [18:22:49] ensure_resource lokos good yah [18:28:57] oh there's also ensure_packages :) [18:31:39] which uses ensure_resource so it has the same semantics in terms of passing an ensure parameter [19:06:13] _joe_: this is turning into a lot of yak shaving [19:22:19] <_joe_> cdanis: this what? [19:22:39] I'm now up to "the schema files between puppet and what's used in unit tests really shouldn't be different" [19:22:44] re: conftool [19:23:07] <_joe_> oh sure they shouldn't be for dbctl [19:23:15] <_joe_> for conftool per-se, it's ok? [19:23:21] yeah that's what I mean [19:23:28] <_joe_> yeah :P [19:23:33] <_joe_> but well [19:23:58] <_joe_> you can have tox download the latest version of the schema from gitiles :} [19:24:20] hah [19:24:26] i am too used to the nice parts of a monorepo [19:24:35] <_joe_> yes :) [19:24:45] (there certainly are nice parts) [19:24:47] <_joe_> but hey, at least we don't use perforce [19:24:53] neither does Google ;) [19:25:03] <_joe_> it used to, though right? [19:25:06] oh yes [19:25:40] and Piper, by design, looks quite Perforce-y from an API perspective [20:46:50] _joe_: what does dependencies between kvobjects do anyway? was it just the syncer that cared? [20:47:50] <_joe_> err I have a patch to make them work [20:47:55] <_joe_> like, properly [20:48:01] <_joe_> but then never found a use for them [20:50:19] should they just be removed from the code, then? [20:50:26] in either case we'll have to edit the schema for discovery objects [21:04:12] I don't think conftool actually works on jessie btw :P [22:02:54] <_joe_> no? [22:03:03] <_joe_> it surely does on scb*