[09:56:49] nice.. cp3035 has scheduled a varnish-backend-restart in 8 hours [09:57:28] and cp3039 in 2 hours and a half [10:01:08] clearly we're slightly short on timing, our 3.5d or whatever it was doesn't work anymore seems [10:01:50] <_joe_> we've an alert on 5xx in esams incoming [10:02:16] <_joe_> a relatively small spike [10:03:50] varnish-be restarted on cp3039 [10:04:53] looks like there were slowly-increasing 5xx earlier today, although the latest spike at 9:56 looks like the recurrent slow query db [10:05:18] <_joe_> godog: how could you tell? [10:05:24] <_joe_> I saw no alerts from pybal [10:05:43] <_joe_> those are also not localized in one dc [10:05:55] the earlier 8:20 -> 9:20 "mountain of 5xx" https://logstash.wikimedia.org/goto/46c360757cddc1ab4fee57b9a320c588 [10:06:07] let me check the slow queries on enwiki [10:06:21] vs the latest short spike https://logstash.wikimedia.org/goto/5472247b271961c2bbe812804db646b5 [10:06:48] _joe_: I'm looking at x-cache table [10:06:59] also uri_host histogram [10:07:05] <_joe_> godog: yeah the latest spike is not only wikipedia [10:07:22] we had spikes at 8:20 and at 9:18 and at 09:54 [10:07:46] so cp3035 and cp3039 are upload nodes [10:07:46] <_joe_> the spike is at 9:56 in 5xx [10:08:02] <_joe_> ok, they won't be the cause of the 5xx on text then [10:08:30] <_joe_> oh https://en.wikipedia.org/wiki/User:Acer/Simple1 times out [10:08:44] <_joe_> the most common page in that span earlier [10:08:51] indeed, it timesout for me too [10:08:58] <_joe_> PHP fatal error: [10:09:00] <_joe_> entire web request took longer than 60 seconds and timed out [10:09:30] <_joe_> now, let's see if mwdebug with php7 works [10:09:47] <_joe_> but I'm ready to bet that page is just impossible to parse in 60 seconds [10:09:59] <_joe_> probably it was taken over the limit [10:10:07] yeah that acer/simple1 url I've seen it in the past on the 5xx dashboard, not new [10:10:11] FWIW that is [10:10:20] <_joe_> and, under php7 it renders [10:10:22] <_joe_> hah [10:10:35] <_joe_> so it times out on HHVM but not under php7 I guess [10:11:02] <_joe_> 50k elements list [10:11:09] :/ [10:11:18] I have to go run an errand and lunch, although I have my laptop with me for emergencies [10:12:48] <_joe_> 1,039,247 bytes [12:32:22] <_joe_> huge spike on upload [12:33:02] <_joe_> vgutierrez: ^^ [12:33:29] yeah... cp3035 I guess [12:33:45] <_joe_> yep [12:33:54] <_joe_> can you restart it? I was going to lunch [12:33:59] cp3035 has the mailbox lag but it is not the one reporting failed fetches [12:34:08] hmm [12:34:23] although it is the one maxing out on connections to multiple cp1* [12:34:51] <_joe_> this is upload, so nothing to do with the applayer/dbs [12:35:06] yeah restarting it [12:35:19] * vgutierrez stands by [12:36:09] we're close to getting rid of upload varnish-backends anyways [12:36:23] should try to squeeze in some time to convert a few more in esams while ema's away [12:36:34] I think cp3039 is also starting to get a stomachache [12:36:49] hmmm again? [12:36:51] yeah the whole pattern of this is odd [12:37:09] https://grafana.wikimedia.org/d/000000478/varnish-mailbox-lag?orgId=1 [12:37:31] two nodes taking off on an mbox lag ramp close to the same time, recurrence on a restarted node, etc [12:37:41] cp3039 is not showing mailbox lag right now but it is maxing out on connections to some cp1* nodes [12:37:43] while upload is prone to the real storage scaling issue, this might not be it [12:37:45] it is interesting [12:38:02] could just be an abnormal miss rate from strange traffic [12:38:09] * cdanis looking at https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=backend&from=now-3h&to=now [12:38:49] I'm assuming cause cp3039 got restarted again due to the cronjob [12:38:53] IIRC the cron line [12:38:56] * vgutierrez rechecking [12:39:18] 29 12 * * 3 root /usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart | /usr/bin/logger -t varnish-backend-restart [12:39:22] right, 10 minutes ago [12:45:53] yeah there are an unusual amount of misses [12:46:31] but that can just be all the restarts, too [12:49:20] anyways, I'm going to kick off another reimage there and see how painful the initial puppeting is [15:34:21] godog: , yt? [15:34:30] q about swift, want to try to use swift cli to upload something [15:39:34] ottomata: yup, shoot! [15:39:43] so, i thinkmaybe i just got osmething [15:39:58] i was trying to use the creds i see in deployment-ms-fe03 account_AUTH_phab.env [15:40:10] since it has e.g. a dummy key [15:40:11] just for testing [15:40:24] i was trying to use the swift CLI flags to [15:40:27] do that [15:40:35] but was failing [15:40:46] then tried swift auth, and saw that it respected some env vars [15:40:49] so i set some [15:40:55] and I think I uploaded something! [15:41:05] are the env vars the preferred way to use swift cli? [15:41:18] (btw, I found swift client in python-swiftclient package [15:41:20] ) [15:41:41] i'm exporting e.g. OS_STORAGE_URL and OS_AUTH_TOKEN [15:41:50] which I got from running swift auth [15:41:53] yeah the easiest is to source the .env files and use the swift client [15:42:00] with ST_AUTH, ST_USER, and ST_KEY set [15:42:02] it'll pick up those variables [15:42:20] would it always be two commands then? [15:42:25] first export the ST_* vars [15:42:37] then e.g. eval swift auth [15:42:43] to export OS_* vars? [15:44:56] I think I lost you on the OS_ vars, after sourcing the .env file then you can use 'swift' command line client as-is, no other env variables required [15:45:24] oh hm [15:45:31] ok maybe that step wasn't necessary [15:47:08] yeah it shouldn't [16:25:03] we have a super-busy queue in PCC today :-/ [17:55:42] all the openstack clients use environment variables for auth details in that manner [18:00:00] Krenair: do you know...once i've uploaded an object to a container, what is the url to get it? [18:00:19] i know the ST_AUTH I'm using is http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/auth/v1.0 [18:00:28] looking in docs but not sure yet... [18:00:52] https://developer.openstack.org/api-ref/object-store/?expanded=get-object-content-and-metadata-detail [18:01:36] oh ok [18:01:40] what's account there? [18:02:39] I think that identifies a tenant/project, which I don't think we use [18:02:50] hm [18:04:47] Oh [18:04:48] Krenair: i think [18:04:51] AUTH_phab [18:04:52] in this case [18:05:03] and, i can set X-Auth-Token [18:05:08] with value given to me by [18:05:11] swift auth [18:05:12] command [18:06:13] DEBUG:swiftclient:REQ: curl -i http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/v1/AUTH_mw/wikipedia-en-timeline-render/33b7ccf43e66e16dafe06b8e65ea0e2b.err -X GET -H "X-Auth-Token: AUTH_redacted" [18:06:40] this is what `swift download wikipedia-en-timeline-render 33b7ccf43e66e16dafe06b8e65ea0e2b.err --debug` does [18:06:40] yeahhhh [18:06:42] thanks [18:06:43] ! [18:07:05] though I'm using mw credentials so idk what stuff you'll have access to [18:07:20] yeah, i found a dummy 'phab' account in deployment-prep [18:07:21] that i'm using [18:07:26] and have been uploading to it [21:02:17] a while ago i made a ticket to build/add a grafana package for stretch. checking again now i notice at some point it was imported because there is one now. though.. surprised that the stretch version is lower than the jessie version (5.4.2 vs 6.1.3) [21:02:58] interested because i want to upgrade krypton (webserver_misc_apps) to stretch and that includes grafana among other [21:05:34] side note: i can find emails _when_ something was done with reprepro but you can't tell who did it because it's always root [21:28:38] krypton no longer includes grafana, that's now running on grafana1001 [21:29:09] I wasn't aware anyone had built or imported grafana6.x packages; I've had it on my backlog to do that for stretch/buster for some weeks now [21:29:42] we might have grafana still installed on krypton, but it is not used there as moritz says [21:58:01] moritzm: oooh. well i did not notice. this makes it way easier :) [21:58:39] cdanis: well, yes, just that jessie has 6.x and stretch does not yet [22:19:33] ever noticed how text formatting that works in phab comments does NOT work in the task description itself? like strike through with ~~ [22:19:57] i thought i got the syntax wrong but it's just because it's in the description field [22:22:21] cdanis: if you want to follow-up on getting 6.x in stretch as well.. feel free to close https://phabricator.wikimedia.org/T210034 [22:22:44] unlinks the parent task because it doesnt affect krypton anymore [22:50:41] out for a bit since there will be more work after 6 [23:32:00] grafana explore tab is awesome, shows apt updates from prometheus!