[00:00:32] Also, the thing I want to try on Monday is moving from 1 container doing everything to one doing the git and zuul stuff and one doing the testing, but we will see [00:00:50] I'll trial it, and as long as it doesn't add to much extra time [00:01:37] addshore: I hadn't thought about tagging patterns. The initial image was easy since I just named it after the job that it was supplanting. Going forward it might make sense to name roughly the way that jjb names things. That is least specific parameter to most specific: trusty-php7-etc ? [00:03:00] I meant more as in the 0.1 etc [00:03:38] Versions of the image [00:04:31] ah, yeah, versioning. That I do have some better opinions about. I started with semver versions, but that doesn't make a lot of sense. I think date versioning might be better and I am also against using "latest" since it's harder to control versions being pulled. [00:05:21] Yup, my thoughts exactly :) [00:05:25] Sweet [00:05:36] cool :) [00:05:47] Right, actually going to bed now [00:06:22] heh, later [00:07:03] I'm going to make some quick edits to that docker page and then peel myself away from the computer for a good while. [00:07:18] thcipriani: it seems to be choking....again? [00:08:31] thcipriani: is there a way to clear out the backend zuul queue so that nodepool doesn't go after 135 instances right away? [00:08:54] beecuse it seems to have dug itself a hole again [00:08:55] oddly [00:09:39] hrm, zuul queue is kept in gearman. Let me dig in docs about it. I remember some fairly ominous warnings about messing with it... [00:09:55] sure [00:12:18] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [00:12:24] well, wtf, nodepool is sure not logging anything at all [00:13:02] ah, maybe just rotated logs [00:13:52] thcipriani: I'm not sure what the play is here atm, it seems like catching up w/ the backlog is melting things even after we stablize [00:14:37] thcipriani: no choice but to stop nodepool [00:14:48] chasemp: ok [00:17:19] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [00:17:35] well. we could kill zuul, but that just leaves all these patches in flight stranded in an undefined state. [00:17:58] which is not a big deal for the test queue, the merge queue would be hectic. [00:18:13] also, not sure if that fixes the underlying problem. [00:18:32] like: if more patches start coming in will that just cause this same issue? [00:20:57] thcipriani: I'm looking at rate [00:21:04] thcipriani: we are talking about throttling nodepool way down to like 5 instances to keep it from thundering [00:21:07] I'm thinking I'm going to clear things out, depending on how they respond [00:21:21] that's part of why I stopped it so quickly to see if we could catc hit before things were dire [00:23:23] I think it may be a timing issue, like nodepool will keep making requests until it has enough servers in a ready state, I'm not sure there's a specific number that'll make it happy. I think fiddling with the "rate" in the nodepool.yaml may be a better knob to turn. [00:23:37] did that number increase recently? [00:23:48] thcipriani: yeah. chasemp is schooling me on that [00:25:14] chasemp: the rate in /etc/nodepool/nodepool.yaml is 4 seconds (?) which is not what it *should* be? I think it should be 8 seconds...? [00:25:25] thcipriani: it's not been retarted [00:25:42] I stopped pupppet and put 4 in there and I'm thinking about letting it go once tings settle to see if we can stable at 4 [00:25:52] thcipriani: 4 seconds? [00:26:19] I thought that was instruction rate [00:26:21] not seconds [00:26:25] i.e. instructions issued per n [00:26:32] maybe that's not right [00:27:19] > In seconds, amount to wait between operations on the provider. Defaults to 1.0 [00:27:38] ok [00:27:42] so higher is slower [00:27:44] ok [00:27:47] (via https://docs.openstack.org/infra/nodepool/configuration.html) [00:28:09] yeah, so maybe jacking that up for a while until it slowly recovers would be good [00:28:30] let's try it [00:28:38] good/cause nodepool not to flail all over openstack until it breaks [00:37:38] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [00:37:53] thcipriani: I kicked off the fun [00:38:03] I saw [00:38:24] thcipriani: seemed like 30m or so from calm to melt before [00:38:27] we think [00:41:02] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Watching / External), 10Cloud-VPS, 10Nodepool, and 2 others: figure out if nodepool is overwhelming rabbitmq and/or nova - https://phabricator.wikimedia.org/T170492#3433128 (10bd808) I ran `sudo rabbitmqctl purge_queue` for the ceilometer... [00:45:39] so it may take a while before it actually starts building anything with 1 instruction every 20 seconds. It listed flavors, then listed extensions, now waiting 20 seconds before sending another command (I think) [00:46:25] ahhhhh [00:46:26] right [00:47:13] thcipriani: my thinking was make it a trickle to see it even not crap itself for awhile [00:48:08] yeah, I think that's fine, more just a heads-up than anything: we'll be waiting a bit before we see anything hit nova [00:49:33] ah ha, first create server task [00:50:22] :) [00:53:19] thcipriani: 2017-09-10 00:53:00,142 DEBUG nodepool.NodeUpdateListener: Unable to find node with nodename: deployment-tin.eqiad ? [00:53:42] 2017-09-10 00:53:00,139 DEBUG nodepool.NodeUpdateListener: Received: onStarted {"name":"beta-code-update-eqiad","url":"job/beta-code-update-eqiad/","build":{"full_url":"https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/171978/","number":171978,"phase":"STARTED","url":"job/beta-code-update-eqiad/171978/","node_name":"deployment-tin.eqiad","node_description":"for beta project","host_name":"contint1001.wik [00:53:42] imedia.org"}} [00:53:58] yeah, I see that flash through the log whenever beta-scap-eqiad is triggered [01:03:10] well, seems to have made it through its first round of create 25, use 25, delete 25 [01:03:19] now building again [01:32:28] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [01:44:49] !log nodepool running steadily again, but has been heavily throttled to hopefully prevent another weekend thundering herd of doom failure for the OpenStack backend [01:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [01:46:04] email sent to releng list re ^ as well [02:24:34] PROBLEM - Puppet staleness on deployment-kafka01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:37:51] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [10.0] [05:52:51] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [06:39:35] 10MediaWiki-Codesniffer: Be aware of extension's minimum MediaWiki version supported - https://phabricator.wikimedia.org/T175465#3594180 (10Legoktm) [07:37:40] (03CR) 10Legoktm: [C: 031] "This seems fine to me. I'll +2 in a few days if no one else has comments? Should we consider validating "@deprecated since $version"?" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377025 (owner: 10Umherirrender) [07:47:33] (03CR) 10Legoktm: [C: 04-1] "LGTM, suggestion about the codes inline" (032 comments) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377028 (owner: 10Umherirrender) [07:56:11] (03PS1) 10Legoktm: Add configuration to generate PHPUnit coverage reports [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377042 [08:17:12] (03CR) 10Legoktm: "Hm, the file comment should be above the namespace and use statements, so ideally the first case wouldn't happen. And the sniff that check" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/375547 (https://phabricator.wikimedia.org/T167694) (owner: 10Legoktm) [08:52:49] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [09:23:30] (03CR) 10Paladox: [C: 031] Whitelist Dvorapa on Zuul CI [integration/config] - 10https://gerrit.wikimedia.org/r/375765 (owner: 10MarcoAurelio) [09:30:54] (03CR) 10Umherirrender: "I have seen it at https://phabricator.wikimedia.org/diffusion/ENUM/browse/master/NewUserMessage.class.php" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/375547 (https://phabricator.wikimedia.org/T167694) (owner: 10Legoktm) [09:35:29] (03PS2) 10Umherirrender: Fix @returns and @throw in function docs [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377028 [09:35:48] (03CR) 10Umherirrender: "Yay, the names are not the best, changed with Patch Set 2" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377028 (owner: 10Umherirrender) [13:47:50] 10Gerrit, 10Patch-For-Review, 10User-Ladsgroup: Make gerrit use the new WMF logo - https://phabricator.wikimedia.org/T174576#3594584 (10Ladsgroup) I resized @BFlores's logo and made the new patch. [18:13:24] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Wikimedia-Logstash, 10Patch-For-Review, 10Technical-Debt: Look into shoving gerrit logs into logstash - https://phabricator.wikimedia.org/T141324#3594855 (10Paladox) I tested on my test instance using mediawiki-vagrant that had kibana and log stash. Using... [18:40:21] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Wikimedia-Logstash, 10Patch-For-Review, 10Technical-Debt: Look into shoving gerrit logs into logstash - https://phabricator.wikimedia.org/T141324#3594877 (10Paladox) Im wondering could we upload the filebeat apt repo from elasticsearch to apt.wikimedia.or... [19:26:16] 10Deployment-Systems, 10Release-Engineering-Team (Backlog), 10wikitech.wikimedia.org, 10User-MarcoAurelio: Create an easier way to add/remove/modify patches for SWAT - https://phabricator.wikimedia.org/T171940#3594906 (10MarcoAurelio) Has any discussion happened about this? My preference would be to use... [19:42:16] 10Deployment-Systems, 10Release-Engineering-Team (Backlog), 10wikitech.wikimedia.org, 10User-MarcoAurelio: Create an easier way to add/remove/modify patches for SWAT - https://phabricator.wikimedia.org/T171940#3480850 (10Zppix) why not use a system like google forms (but a open source/wikimedia maintained) [20:01:19] (03Draft1) 10MarcoAurelio: Archive Extension:Ads [integration/config] - 10https://gerrit.wikimedia.org/r/377071 (https://phabricator.wikimedia.org/T175495) [20:01:23] (03PS2) 10MarcoAurelio: Archive Extension:Ads [integration/config] - 10https://gerrit.wikimedia.org/r/377071 (https://phabricator.wikimedia.org/T175495) [20:04:51] 10Deployment-Systems, 10Release-Engineering-Team (Backlog), 10wikitech.wikimedia.org, 10User-MarcoAurelio: Create an easier way to add/remove/modify patches for SWAT - https://phabricator.wikimedia.org/T171940#3595002 (10Reedy) [20:13:21] (03CR) 10Zoranzoki21: [C: 031] Archive Extension:Ads [integration/config] - 10https://gerrit.wikimedia.org/r/377071 (https://phabricator.wikimedia.org/T175495) (owner: 10MarcoAurelio) [20:42:46] Project selenium-Echo » firefox,beta,Linux,BrowserTests build #513: 04FAILURE in 1 min 46 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/513/ [20:42:47] Project selenium-Echo » chrome,beta,Linux,BrowserTests build #513: 04FAILURE in 1 min 47 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/513/ [21:17:40] 10Deployment-Systems, 10Release-Engineering-Team (Backlog), 10wikitech.wikimedia.org, 10User-MarcoAurelio: Create an easier way to add/remove/modify patches for SWAT - https://phabricator.wikimedia.org/T171940#3595113 (10mmodell) >>! In T171940#3594906, @MarcoAurelio wrote: > Has any discussion happened ab... [21:19:33] 10Deployment-Systems, 10Release-Engineering-Team (Backlog), 10wikitech.wikimedia.org, 10User-MarcoAurelio: Create an easier way to add/remove/modify patches for SWAT - https://phabricator.wikimedia.org/T171940#3595129 (10Zppix) >>! In T171940#3595113, @mmodell wrote: > I looked into the PageForms extension... [21:34:37] (03CR) 10Paladox: [C: 031] Archive Extension:Ads [integration/config] - 10https://gerrit.wikimedia.org/r/377071 (https://phabricator.wikimedia.org/T175495) (owner: 10MarcoAurelio) [21:36:40] (03CR) 10Zppix: [C: 031] Archive Extension:Ads [integration/config] - 10https://gerrit.wikimedia.org/r/377071 (https://phabricator.wikimedia.org/T175495) (owner: 10MarcoAurelio) [22:57:21] 10MediaWiki-Codesniffer: Figure out how to properly document variadic arguments - https://phabricator.wikimedia.org/T175504#3595178 (10Legoktm)