[00:54:34] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2671208 (10mmodell) @paladox done: {T146843} [01:07:15] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2672705 (10mmodell) I'm actually not sure what to make of this issue. The abovementioned projects show up in the global search autocomplete dropdown. That would seem to... [07:02:06] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [08:03:05] (03CR) 10Hashar: [C: 032] Specifying EventLogging as a CollaborationKit dependency. [integration/config] - 10https://gerrit.wikimedia.org/r/313122 (owner: 10Harej) [08:03:43] (03Merged) 10jenkins-bot: Specifying EventLogging as a CollaborationKit dependency. [integration/config] - 10https://gerrit.wikimedia.org/r/313122 (owner: 10Harej) [08:10:52] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2671208 (10hashar) I have tried all three examples mentioned and they know yield the appropriate result. I guess it is all related to the search index still being reinde... [08:14:29] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2673397 (10mmodell) 05Open>03Resolved a:03mmodell Ok I had to reindex projects using the --force argument, e.g: ```lang=s twentyafterfour@iridium:/srv/phab/phabri... [08:16:03] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2673401 (10mmodell) @hashar: The database reindex didn't actually fix it until I specifically reindexed projects with --force [08:18:07] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2673403 (10mmodell) >>! In T146673#2671505, @Paladox wrote: > Oh, but In innodb depending on what MySQL or mariadb version y... [08:21:24] twentyafterfour: congrats on fixing the Phabricator search index for projects :] [08:22:01] The global search index job seems to be still running though more slowly than before [08:22:07] PROBLEM - Puppet run on integration-slave-jessie-1005 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [08:22:27] twentyafterfour: I was wondering what was the spike of writes on db1048 half an hour or so ago [08:22:39] I guess some reindex is still running [08:24:06] and thanks to the SQL query on T146673 I have learned that the search entry can use + and - (eg boolean mode search https://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html ) [08:24:07] neat [08:28:28] hashar: I haven't been able to get - to work for some reason [08:29:38] well [08:29:42] if I search for: mediawiki segfault [08:29:45] that yields a lot of tasks [08:29:53] but +mediawiki +segfault only returns a single task :] [08:30:10] and +segfault mediawiki yields three tasks [08:30:16] with the one referencing mediawiki being the firsdt [08:30:29] twentyafterfour: so looks like it works for me ? :D [08:31:15] !log Reloading Zuul to deploy dc2ada37 [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [08:41:00] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2673453 (10Tgr) Search quality for multiple terms is still extremely poor. E.g. for the query [[https://phabricator.wikimedia.org/search/query/q4v1FMKd4p4Y/#R|phabricato... [08:47:08] RECOVERY - Puppet run on integration-slave-jessie-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [08:49:09] hashar: + works but does - work? [08:49:45] +segfault : two tasks [08:49:52] errrr [08:49:56] cancel that :D [08:50:05] +segfault --> 3 tasks [08:50:38] +segfault -hhvm : 2 tasks which excludes T78558 about a hhvm segfault [08:51:29] segfault -hhvm : two tasks as well [08:51:41] twentyafterfour: looks like - works :] [08:52:05] cool [08:58:32] for stemming / ElasticSearch I would not obsess about it [08:58:36] not sure if it is worth the trouble :] [09:00:06] hashar: zeljkof: just out of curiosity, is there a way for e.g. test.wikidata.org that the browsertests job runs the version of browsertests from the branch that is actually deployed on test.wikidata.org? [09:00:16] currently it always runs the tests from latest master [09:00:33] Tobi_WMDE_SW: no, it should run from the same branch that is at the target site [09:00:51] meaning, it should already do that [09:00:57] it should not run from master [09:01:05] * zeljkof is looking up code [09:01:08] zeljkof: hm.. [09:01:12] yeah we have a hack for that [09:01:36] a python script that queries MEDIAWIKI_URL/w/api.php to get the branch / version [09:01:38] hashar: zeljkof: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/122/consoleFull [09:01:42] so the job does start with master [09:01:48] but is then switched to the appropriate branch [09:01:51] https://integration.wikimedia.org/ci/view/Selenium/job/selenium-Wikibase/122/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/consoleFull [09:01:53] 04:40:25 Checking out Revision c9ff7e1907125572519f8b3ddbc9b2b82fa13dfc (origin/master) [09:01:54] (which mess up with the git plugin) [09:01:58] 00:00:15.898 + git checkout -f origin/wmf/1.28.0-wmf.20 [09:02:08] Tobi_WMDE_SW: look lower in the outpu [09:02:14] output [09:02:23] basically we short circuit the Jenkins git plugin which always use master [09:02:39] it first checks out master, then later pings the target site for it's branch, and checks it out [09:02:46] ah, 04:40:27 origin/wmf/1.28.0-wmf.20 branch does not exist. 04:40:27 + echo 'Fallbacking to master branch...' [09:03:00] I guess that repo lack the wmf branch? [09:03:47] Tobi_WMDE_SW: https://phabricator.wikimedia.org/diffusion/CICF/browse/master/jjb/job-templates-selenium.yaml;dc2ada375706f31c11af8c254a20672b04329ef2$50-72 [09:04:01] oh, that could be the case [09:04:08] I did not check [09:04:18] yeah [09:04:25] wmf.20 is not in Wikibase.git :( [09:04:44] hashar: is it something we messed up, or wikibase people? :) [09:05:37] that is because Wikidata is deployed asynchronously [09:05:48] hashar: zeljkof: ya, we don not deploy regularly [09:05:50] eg we cut wmf.20 for all repos BUT wikibase/wikidata [09:05:54] deploy wmf.20 [09:06:10] but Wikibase/Wikidata is explicitly kept at some previous version [09:06:15] which is then updated out of the mw train [09:07:25] hashar: so, maybe we need to add a fallback to the latest branch in case the actual branch is not found.. [09:07:52] the script queries https://test.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=general [09:07:58] or query the extension for the info which branch is deployed.. if that's possible [09:08:05] which has query: general: git-branch: "wmf/1.28.0-wmf.20" [09:08:19] yeah [09:08:25] so if the API exposes the branch of each extension [09:08:28] we could use that instead [09:09:25] then https://test.wikipedia.org/wiki/Special:Version [09:09:28] only lisdt the sha1 [09:09:32] not sure whether the branch is collected [09:11:58] if we only have the sha1, we can lookup the branch with something like git branch --remote --contains 123456 [09:12:03] and some appropriate sorting [09:12:13] but then we have cherry pick on the cluster, so the sha1 might not be in the Gerrit repo :( [09:12:29] hashar: that sounds evil.. :( [09:12:41] so in short, we need the mw api to expose the branch of extensions [09:12:44] maybe it does already [09:12:53] and find out the API query that yields the list of extensions [09:13:00] hashar: ewwww [09:13:02] then the python script can be hacked it [09:13:17] the git_branch parameter, it is exposed solely for browser tests [09:13:25] I have actually added support for that in the API years ago [09:15:05] https://gerrit.wikimedia.org/r/#/c/131861/ / https://phabricator.wikimedia.org/T64509 :D [09:18:07] Tobi_WMDE_SW: I have checked the Git info cache on prod. The "branch" has the sha1 of the commit not then ame :( [09:18:10] name [09:21:48] yeah [09:21:57] GitInfo is probably broken :D [09:22:22] it tries to read .git/HEAD for a list of ref [09:22:32] but when it is a submodule the .git/HEAD file has: gitdir: ../../.git/modules/extensions/Wikidata [09:23:16] or maybe that is taken in account elsewhere [09:23:51] ah yes it is then the file only has 52c197736dbbdc36cf0b1e45b8dd09701ae96117 [09:25:24] Tobi_WMDE_SW: addshore: so it is probably doable by hacking in MediaWiki GitInfo to get it to properly find the branch of the extensions (which are submodules) [09:25:40] then add some an entry point in the API to expose them [09:30:36] hashar: ok, thx for looking into this [09:31:12] anothersolution would be to fall back to the newses existing wmf-branch [09:31:16] *newest [09:31:27] if that's easier to do [09:47:00] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-Unit-tests, 10MediaWiki-extensions-WikibaseClient, and 4 others: Job mediawiki-extensions-php55 frequently fails due to "Segmentation fault" - https://phabricator.wikimedia.org/T142158#2673630 (10daniel) Thank you for looking... [09:52:21] Tobi_WMDE_SW: yeah that is surely doable [09:53:54] actually that is way easier probably :D the logic being in jjb/job-templates-selenium.yaml [09:54:10] would have to look at the wmf branches in the local repo [09:54:20] and pick the last wmf one assuming it is the one deployed [09:58:25] it is cooking time [10:23:03] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2673719 (10Paladox) Maybe we will want to also do --force for manifest tasks to per ^^ [10:23:25] hashar hi, doint forget the gc patches from yesturday :) [10:25:53] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2673721 (10Paladox) Thanks. [10:26:13] 10Gerrit: Gerrit account issue (can't log into Gerrit, can log into wikitech) - https://phabricator.wikimedia.org/T146887#2673722 (10Eloquence) [10:33:12] paladox: let me push that one indeed [10:33:25] Oh thanks :) [10:33:58] (03CR) 10Hashar: [C: 032] Disable garbage collection for mw-phpunit.sh too [integration/jenkins] - 10https://gerrit.wikimedia.org/r/313051 (https://phabricator.wikimedia.org/T142158) (owner: 10Paladox) [10:34:07] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-Unit-tests, 10MediaWiki-extensions-WikibaseClient, and 4 others: Job mediawiki-extensions-php55 frequently fails due to "Segmentation fault" - https://phabricator.wikimedia.org/T142158#2673744 (10Paladox) Your welcome :) [10:36:29] hashar would you also be able to merge https://gerrit.wikimedia.org/r/313056 please? [10:36:31] For wikibase [10:36:45] then we will be able to un block the test that kept seg faulting [10:36:54] yup [10:36:56] but will do that one later [10:37:01] I am going to get out for lunch soonish [10:37:10] Ok [10:37:12] thanks [10:40:17] (03Merged) 10jenkins-bot: Disable garbage collection for mw-phpunit.sh too [integration/jenkins] - 10https://gerrit.wikimedia.org/r/313051 (https://phabricator.wikimedia.org/T142158) (owner: 10Paladox) [10:43:23] !log Updating slave scripts for "Disable garbage collection for mw-phpunit.sh" https://gerrit.wikimedia.org/r/313051 T142158 [10:43:26] paladox: pushed [10:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:43:32] thankyou :) [10:43:38] (03PS3) 10Hashar: Disable garbage collection for wikibase tests [integration/config] - 10https://gerrit.wikimedia.org/r/313056 (https://phabricator.wikimedia.org/T142158) (owner: 10Paladox) [10:44:54] !log CI updating all mwext-Wikibase* jenkins jobs for https://gerrit.wikimedia.org/r/#/c/313056/ T142158 [10:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [10:45:21] paladox: and I have updated the Jenkins job [10:45:28] Oh thankyou :) [10:45:53] (03CR) 10Paladox: [C: 031] "Needs +2 :)" [integration/config] - 10https://gerrit.wikimedia.org/r/313056 (https://phabricator.wikimedia.org/T142158) (owner: 10Paladox) [10:46:32] (03CR) 10Hashar: [C: 032] "I have updated all 9 jobs" [integration/config] - 10https://gerrit.wikimedia.org/r/313056 (https://phabricator.wikimedia.org/T142158) (owner: 10Paladox) [10:47:29] (03Merged) 10jenkins-bot: Disable garbage collection for wikibase tests [integration/config] - 10https://gerrit.wikimedia.org/r/313056 (https://phabricator.wikimedia.org/T142158) (owner: 10Paladox) [10:52:04] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-Unit-tests, 10MediaWiki-extensions-WikibaseClient, and 4 others: Job mediawiki-extensions-php55 frequently fails due to "Segmentation fault" - https://phabricator.wikimedia.org/T142158#2673757 (10Paladox) @daniel and @hoo the... [10:53:11] paladox: hoo had a hack to skip some faulty test [10:53:21] might want to craft a change that revert/Remove the hack and see whether it still segfault [10:53:26] I am out for lunch [10:53:27] yep [10:53:32] but I am nearby [10:54:08] oh lol [10:55:11] :) [11:46:24] !sal [11:46:24] https://tools.wmflabs.org/sal/releng [11:48:28] !log Deleting deployment-tin Trusty instance and recreate one with same hostname as Jessie; Meant to replace deployment-tin02 T144006 [11:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [12:03:49] RECOVERY - Host deployment-tin is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [12:08:03] RECOVERY - SSH on deployment-tin is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:38:21] (03PS2) 10Hashar: Whitelist Neslihan [integration/config] - 10https://gerrit.wikimedia.org/r/312868 (owner: 10Mholloway) [12:38:25] (03CR) 10Hashar: [C: 032] Whitelist Neslihan [integration/config] - 10https://gerrit.wikimedia.org/r/312868 (owner: 10Mholloway) [12:39:19] 10Browser-Tests-Infrastructure, 10MobileFrontend, 06Reading-Web-Backlog, 07Browser-Tests: Cucumber tests won't run locally on firefox 47 - https://phabricator.wikimedia.org/T138095#2673984 (10zeljkofilipin) [12:39:22] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 07Upstream, 15User-zeljkofilipin: Firefox v47 breaks mediawiki_selenium - https://phabricator.wikimedia.org/T137561#2673982 (10zeljkofilipin) 05Open>03Resolved Firefox 47.0.1 works fine with our current infrastructure. See {T137540} for r... [12:39:22] (03Merged) 10jenkins-bot: Whitelist Neslihan [integration/config] - 10https://gerrit.wikimedia.org/r/312868 (owner: 10Mholloway) [12:40:02] (03CR) 10Hashar: "Deployed :]" [integration/config] - 10https://gerrit.wikimedia.org/r/312868 (owner: 10Mholloway) [12:44:07] !log Cant finish up the switch to deployment-tin, puppet still does not pass due to weird clone issues ... [12:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [12:45:21] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 07Upstream, 15User-zeljkofilipin: Firefox v47 breaks mediawiki_selenium - https://phabricator.wikimedia.org/T137561#2674005 (10hashar) We have the slaves pinned at 46 for now. Should we switch to 47.0.1 ? [12:47:18] meeting [12:53:58] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Config, 07Upstream, 15User-zeljkofilipin: Firefox v47 breaks mediawiki_selenium - https://phabricator.wikimedia.org/T137561#2674024 (10zeljkofilipin) Both 46 and 47.0.1 are fine (but please notice that 47.0.0 will break everything). If it is not a lo... [13:46:58] Yippee, build fixed! [13:46:59] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #161: 09FIXED in 2 min 58 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/161/ [14:13:00] 10Browser-Tests-Infrastructure, 06Reading-Web-Backlog, 07Jenkins, 13Patch-For-Review, and 2 others: MEDIAWIKI_URL may be set to incorrect value in mwext-mw-selenium job - https://phabricator.wikimedia.org/T144912#2674157 (10zeljkofilipin) a:05zeljkofilipin>03None [14:14:14] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-extensions-Examples, 07Documentation, and 5 others: Improve documentation around running/writing (with lots of examples) browser tests - https://phabricator.wikimedia.org/T108108#2674161 (10zeljkofilipin) a:03zeljkofilipin [14:21:43] Is there some magic comment to run mediawiki-phpunit-php55-trusty without actually +2ing a change? [14:26:57] anomie i think it is check php5 [14:27:11] Thanks, I'll try that [14:27:11] check php5 [14:27:16] Your welcome [14:30:17] zeljkof hashar hi there seems to be a job that has been running for two hours at https://integration.wikimedia.org/zuul/ [14:30:23] mwext-mw-selenium [14:30:29] https://integration.wikimedia.org/ci/job/mwext-mw-selenium/10445/ [14:30:36] paladox: uh oh [14:30:44] Yep [14:30:45] should not be running for that long [14:30:48] Yep [14:30:51] Seems to have froze [14:31:04] or could it have been the abort script dosent work properly? [14:31:09] just a second to finish something, will check, thanks for the heads up [14:31:30] Your welcome [14:32:05] zeljkof im now wondering could this be a bug with jenkins? Since this has happened before. I am wondering if it is fixed in jenkins 2 or it requires a bug filling [14:34:32] paladox: hm, not sure what the timeout for the job is [14:34:41] Oh [14:34:52] 30mins [14:35:00] zeljkof ^^ [14:35:05] 00:30:00.027 Build timed out (after 30 minutes). Marking the build as failed. [14:35:09] Yep [14:35:17] but it still did not get aborted somehow :| [14:35:20] Seems like it dosen't really fail [14:35:30] hasharAway would probably know, but he's away [14:35:45] Maybe we will want to create a script for after the job aborts to really make sure it gets aborted [14:36:50] I'm trying to abort the job via jenkins web interface, but it does not work o.O [14:37:38] zeljkof i think it may be because of https://wiki.jenkins-ci.org/display/JENKINS/Build-timeout+Plugin (yellow warnning box) [14:38:53] this is the patch the job is related to https://gerrit.wikimedia.org/r/#/c/313203/1 [14:39:02] Oh [14:39:26] zeljkof should i do a recheck? [14:39:31] hello, our machine on deployment-prep doesn't look like it gets deployed to: deployment-eventlogging03 has code from some months ago, how can we fix that? [14:39:47] paladox: please don't! [14:39:50] Ok [14:39:59] I think it will just start another job [14:40:03] Oh [14:40:30] ping hasharAway or greg-g [14:43:40] what code is not being updated nuria_? [14:43:58] paladox: ok, I have stopped the job, will create a ticket in phab and ping hasharAway to take a look later [14:44:07] Ok thanks [14:45:38] nuria_, /srv/deployment/eventlogging/eventlogging hasn't been updated recently? [14:45:50] Krenair: I *think* the code deployed to /srv/deployment/eventlogging in eployment-eventlogging03 is not updated automatically [14:45:56] you could log into deployment-mira and deploy a new version [14:46:08] Krenair: is that documented somewhere? [14:46:10] I'm not sure if anything beyond mediawiki and scap are deployed automatically [14:46:21] well, is production updated automatically? [14:48:47] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 07Jenkins, 15User-zeljkofilipin: mwext-mw-selenium jenkins job not aborting after 30 minutes - https://phabricator.wikimedia.org/T146903#2674276 (10zeljkofilipin) [14:49:02] paladox: ^ [14:49:09] Thankyou [14:49:09] :) [14:49:46] paladox: thank _you_ ;) [14:49:55] Your welcome :) [14:50:03] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 07Jenkins, 15User-zeljkofilipin: mwext-mw-selenium jenkins job not aborting after 30 minutes - https://phabricator.wikimedia.org/T146903#2674292 (10Paladox) May be related to https://wiki.jenkins-ci.org/display/JENKINS/Build-timeout+P... [14:58:42] zeljkof it seems this https://integration.wikimedia.org/ci/job/mediawiki-core-php55lint/14372/ has also got stuck [14:59:17] 10Browser-Tests-Infrastructure, 10Continuous-Integration-Infrastructure, 07Jenkins, 15User-zeljkofilipin: mwext-mw-selenium jenkins job not aborting after 30 minutes - https://phabricator.wikimedia.org/T146903#2674299 (10Paladox) Another job that got stuck https://integration.wikimedia.org/ci/job/mediawik... [14:59:30] paladox: argh :( [14:59:39] please add link to the task [14:59:39] Yep [14:59:43] have to go to a meeting [14:59:44] Already done :) [14:59:46] ok [14:59:51] thanks [15:00:01] your welcome :) [15:01:16] (03CR) 10Paladox: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/227223 (https://phabricator.wikimedia.org/T105474) (owner: 10Hashar) [15:02:24] 10Continuous-Integration-Config, 10RESTBase, 06Wikipedia-Android-App-Backlog: Kick off periodic Android CI tests when RESTBase is updated on beta labs - https://phabricator.wikimedia.org/T146488#2674302 (10Mholloway) Adding @bearND and @mobrovac for thoughts. >>! In T146488#2669083, @greg wrote: > 1) work... [15:03:33] (03CR) 10Paladox: "@hashar would you be able to re review this please?" [integration/config] - 10https://gerrit.wikimedia.org/r/227223 (https://phabricator.wikimedia.org/T105474) (owner: 10Hashar) [15:09:15] 10Continuous-Integration-Config, 10RESTBase, 06Wikipedia-Android-App-Backlog: Kick off periodic Android CI tests when RESTBase is updated on beta labs - https://phabricator.wikimedia.org/T146488#2674309 (10Mholloway) >>! In T146488#2674302, @Mholloway wrote: > That said, it looks like the test is no longer e... [15:11:27] 10Beta-Cluster-Infrastructure: Cannot create account on deployment.wikimedia.beta.wmflabs.org: No Captcha due to 503 error - https://phabricator.wikimedia.org/T146904#2674312 (10Aklapper) [15:22:59] 10Beta-Cluster-Infrastructure: Cannot create account on deployment.wikimedia.beta.wmflabs.org: No Captcha due to 503 error - https://phabricator.wikimedia.org/T146904#2674312 (10AlexMonk-WMF) Looking at varnishlog, it successfully gets to deployment-mediawiki04, but then this happens: ``` 13 FetchError c Inv... [15:27:59] < Content-Encoding: gzip [15:28:05] �PNG [15:28:07] really... [15:30:38] 10Beta-Cluster-Infrastructure: Cannot create account on deployment.wikimedia.beta.wmflabs.org: No Captcha due to 503 error - https://phabricator.wikimedia.org/T146904#2674372 (10AlexMonk-WMF) It seems that Apache claims it's sending gzipped content (with `Content-Encoding: gzip`), but it then sends just a plain... [15:36:51] 10Continuous-Integration-Infrastructure: PHP7 support in CI (tracking) - https://phabricator.wikimedia.org/T144964#2674374 (10Paladox) We could potentially use the Jessie instances @legoktm setup for some of the basic tests that doint test heavly, like phplint-7 tests but those could possibly go on nodepool, but... [15:38:41] Project selenium-MobileFrontend » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #173: 04FAILURE in 16 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/173/ [15:39:28] (03PS1) 10Paladox: Add php7 pipeline for zuul [integration/config] - 10https://gerrit.wikimedia.org/r/313213 (https://phabricator.wikimedia.org/T144872) [15:39:36] greg-g: Hi, is it possible to deploy https://gerrit.wikimedia.org/r/#/c/313207/ as emergency deploy? The event will be hold this Friday and there is no regular window. Thanks in advance for approval. ) [15:40:02] Urbanecm, see the other channel [15:40:28] :) [15:41:03] Krenair and greg-g: Thanks you both. [15:45:56] Project selenium-MobileFrontend » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #173: 04FAILURE in 23 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/173/ [15:51:40] legoktm hi, im thinking we can make all the tests run in php7, using the experimental pipeline and could possibly use the above ^^ for a php7 pipeline [15:51:56] Im wondering on the jessie instances do they have php7 [15:52:03] or just he nodepool jessie instances? [15:53:18] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674445 (10jcrespo) Here is the things I would like to be done to close this ticket (all ongoing issues solved): * Check th... [15:53:46] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674446 (10jcrespo) [15:54:02] 10Beta-Cluster-Infrastructure: Cannot create account on deployment.wikimedia.beta.wmflabs.org: No Captcha due to 503 error - https://phabricator.wikimedia.org/T146904#2674447 (10AlexMonk-WMF) https://gerrit.wikimedia.org/r/#/c/311357/7 touched this area of code recently [16:01:28] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674457 (10jcrespo) I would add one extra bullet point to the TODO: evaluate other potentially problematic Aria/MyISAM table... [16:02:37] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674476 (10Paladox) >>! In T146673#2674457, @jcrespo wrote: > I would add one extra bullet point to the TODO: evaluate other... [16:03:26] 10Beta-Cluster-Infrastructure: Cannot create account on deployment.wikimedia.beta.wmflabs.org: No Captcha due to 503 error - https://phabricator.wikimedia.org/T146904#2674506 (10AlexMonk-WMF) The obResetFunc property of the result of `FileBackendGroup::singleton()->get( 'global-multiwrite' )` points to the backe... [16:14:27] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T145220#2674573 (10AlexMonk-WMF) [16:19:13] 10Beta-Cluster-Infrastructure, 06Operations, 07Puppet: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#2674583 (10hashar) [16:21:34] 10Beta-Cluster-Infrastructure, 06Operations, 07Puppet: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#2674598 (10hashar) [16:32:57] PROBLEM - Keyholder status on deployment-tin02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:00:04] (03PS1) 10Paladox: Update the mediawiki core tests to also test against php7 [integration/config] - 10https://gerrit.wikimedia.org/r/313223 (https://phabricator.wikimedia.org/T144964) [17:01:05] (03PS2) 10Paladox: Update the mediawiki core tests to also test against php7 [integration/config] - 10https://gerrit.wikimedia.org/r/313223 (https://phabricator.wikimedia.org/T144964) [17:02:13] (03CR) 10jenkins-bot: [V: 04-1] Update the mediawiki core tests to also test against php7 [integration/config] - 10https://gerrit.wikimedia.org/r/313223 (https://phabricator.wikimedia.org/T144964) (owner: 10Paladox) [17:45:32] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674853 (10mmodell) @jcrespo: Are there other myisam tables? I was under the impression that this was the only one but I wil... [17:51:44] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674872 (10Paladox) @mmodell yep, there are other myisam tables. [17:55:53] (03PS3) 10Paladox: Update the mediawiki core tests to also test against php7 [integration/config] - 10https://gerrit.wikimedia.org/r/313223 (https://phabricator.wikimedia.org/T144964) [17:57:13] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674879 (10mmodell) ```lang=sql SELECT TABLE_SCHEMA, TABLE_NAME FROM information_schema.TABLES WHERE Engine = 'MyISAM'; `... [17:57:37] (03PS4) 10Paladox: Update the mediawiki core tests to also test against php7 [integration/config] - 10https://gerrit.wikimedia.org/r/313223 (https://phabricator.wikimedia.org/T144964) [17:58:39] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2674880 (10mmodell) @jcrespo: Those two tables ^ appear to be all that remain on myisam. I can't see any reason why we shoul... [18:00:42] (03CR) 10Paladox: [C: 031] "This is good to go since this future proofs ci." [integration/config] - 10https://gerrit.wikimedia.org/r/313223 (https://phabricator.wikimedia.org/T144964) (owner: 10Paladox) [18:11:43] (03PS1) 10Paladox: Reuse phplint code in job-template.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/313230 [18:12:46] (03CR) 10jenkins-bot: [V: 04-1] Reuse phplint code in job-template.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/313230 (owner: 10Paladox) [18:16:28] (03PS2) 10Paladox: Reuse phplint code in job-template.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/313230 [18:29:32] 06Release-Engineering-Team, 10Phabricator: Phab Advanced Search no longer showing typical results - https://phabricator.wikimedia.org/T146789#2675015 (10mmodell) @tgr: Hopefully {T146843} will address the quality of results. I don't think innodb is doing a very good job. @paladox: That would take several days... [18:51:41] 10Beta-Cluster-Infrastructure, 06Operations, 07Puppet: grain-ensure erroneous mismatch with (bool)True vs (str)true - https://phabricator.wikimedia.org/T146914#2675080 (10hashar) Looks like the main reason we have `grain-ensure.py` is to execute salt commands without a master (file_config = local). Nowadays... [18:53:03] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2675083 (10mmodell) @jcrespo: So I think what remains for this task is as follows: 1. Reduce `innodb_ft_min_token_size` fr... [19:02:07] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2675099 (10Paladox) "1. Reduce innodb_ft_min_token_size from 3 to 2" Done here https://gerrit.wikimedia.org/r/#/c/313235/ [19:08:10] 06Release-Engineering-Team, 10Phabricator, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2675121 (10Paladox) "2. Import stopwords from https://github.com/phacility/phabricator/blob/master/resources/sql/stopwords.t... [19:43:41] so [19:43:44] going to break beta [19:43:53] drop deployment-tin02 [19:43:57] replace it with a new deployment-tin [19:44:06] and change scap master from deployment-mira to deployment-tin [19:44:43] sounds like a fun evening [19:45:56] 06Release-Engineering-Team, 06Developer-Relations, 10Phabricator, 10Wikimedia-Blog-Content: blog.wikimedia.org post on Phabricator improvements - https://phabricator.wikimedia.org/T141457#2675211 (10EdErhart-WMF) Great! Mel's comments above then are perfect. I think you'll want to use a framework that demo... [19:49:30] !log Dropping deployment-tin02 , replacing it with deployment-tin which has been rebuild to Jessie T144006 [19:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:02:11] Project beta-scap-eqiad build #122025: 04FAILURE in 7 min 31 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122025/ [20:03:08] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:04:02] Project beta-update-databases-eqiad build #11679: 15ABORTED in 44 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11679/ [20:04:08] Project beta-code-update-eqiad build #123460: 15ABORTED in 1 min 8 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/123460/ [20:04:30] OH MY god [20:04:32] sudo please [20:04:46] sudo -u mwdeploy -n -- rsync [20:04:59] Failed to add the ECDSA host key for IP address '10.68.19.42' to the list of known hosts (/home/hashar/.ssh/known_hosts). [20:05:00] bah [20:05:50] lol [20:07:43] PROBLEM - Puppet run on deployment-cache-upload04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:08:25] !log deployment-tin for instance in `grep deployment /etc/dsh/group/mediawiki-installation`; do ssh-keyscan `dig +short $instance` >> /etc/ssh/ssh_known_hosts; done; [20:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:08:31] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:09:46] !log deployment-tin: keyholder arm [20:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:11:15] !log Switch Jenkins slave deployment-mira.eqiad to deployment-tin.eqiad [20:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:13:07] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:19] Project beta-code-update-eqiad build #123461: 15ABORTED in 16 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/123461/ [20:16:17] hasharAway hi, im getting this warnning [20:16:18] WARNING:root:logrotate is deprecated on jenkins>=1.637, use the property build-discarder on newer jenkins instead [20:16:25] when updating zuul project [20:16:31] By running [20:16:38] Nope [20:16:39] when [20:16:43] running jenkins job builder [20:16:52] jenkins-jobs --conf etc/jenkins_jobs.ini update config/ [20:17:15] We will want to update it with http://docs.openstack.org/infra/jenkins-job-builder/properties.html#properties.build-discarder [20:19:01] !log restarted keyholder on deployment-tin [20:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:19:08] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:20:01] Project beta-update-databases-eqiad build #11680: 04FAILURE in 1.1 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11680/ [20:20:42] Project beta-scap-eqiad build #122026: 04STILL FAILING in 33 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122026/ [20:21:06] oh for god sake again [20:21:22] thcipriani: dont you have a patch to add proper PHP packages on the deployment servers? [20:22:43] ugh. No, I filed a task for it [20:22:45] * thcipriani digs [20:24:08] blerg. It was the mwscript task. Hard-coded php5 in mwscript causing...whatever [20:24:43] ahhh [20:25:04] lets hmm [20:25:09] Project beta-scap-eqiad build #122027: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122027/ [20:25:19] pick a class between mediawiki::packages::legacy and mediawiki::packages:php5 :D [20:25:25] indeed [20:26:03] legacy is for precise [20:26:30] https://phabricator.wikimedia.org/T146286 [20:26:55] and we have a role::deployment::mediawiki class doh [20:26:56] so php5-{memcached,redis,mysql,curl} should do it [20:27:14] the deployment server puppet is a complete mess. [20:27:40] * hasharAway adds more [20:27:45] :D [20:28:32] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [20:29:08] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [20:29:52] https://gerrit.wikimedia.org/r/313305 Bring back Zend PHP on deployment server [20:29:59] (03PS1) 10Paladox: Replace derecated logrotate with build-discarder [integration/config] - 10https://gerrit.wikimedia.org/r/313306 [20:30:09] hasharAway ^^ done :) [20:30:38] neat :] [20:31:18] Yep :) [20:31:41] hasharAway but i also have these other warnnings [20:31:42] WARNING:jenkins_jobs.registry:You have a macro ('tox') defined for 'builder' component type that is masking an inbuilt definition [20:31:42] WARNING:jenkins_jobs.registry:You have a macro ('doxygen') defined for 'builder' component type that is masking an inbuilt definition [20:32:46] Maybe we should rename https://github.com/wikimedia/integration-config/blob/69ff4d3eb9d57710525374533213d39cf2434d54/jjb/macro.yaml#L545 [20:32:50] to toxlint? [20:33:05] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 13Patch-For-Review: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2675450 (10hashar) For some reason the Zend packages are no more installed on Jessie, though we actually need them at least for mwscript... [20:33:34] paladox: run-tox :] [20:33:37] and run-doxygen [20:33:45] Oh thanks [20:33:46] :) [20:33:47] I would use that [20:33:52] as our custom macros [20:33:57] Oh [20:34:02] So i shoulden rename it [20:34:07] just create a new builder? [20:34:14] which are shorter than wikimedia-override-of-jjb-builtin_doxygen [20:34:15] :D [20:34:22] rename! [20:34:25] Ok [20:34:26] thanks [20:34:27] :) [20:34:27] and gotta rename all past occurences [20:34:28] in the end [20:34:32] Oh [20:34:37] Im using github to help [20:34:38] the job that does the configuration diff would show an empty diff [20:34:50] thcipriani: fixed :] [20:34:52] then using my local search to find the rest of the references [20:35:02] thcipriani: so I guess someone removed Zend from jessie [20:35:18] thcipriani: and whenever we reimage tin.eqiad.wmnet, we will miss the zend extensions again :D [20:35:39] this thing is why/how it was removed: https://github.com/wikimedia/operations-puppet/blob/production/modules/mediawiki/manifests/packages.pp#L8 [20:35:56] ah [20:36:08] left me to wonder how mira in prod will behave [20:36:22] ah: https://github.com/wikimedia/operations-puppet/commit/35f6983b1a60b9f7cf1fa120dcf2f1df8fc8374c [20:36:59] thcipriani: can you paste that to the task please ? [20:37:10] and mira.codfw.wmnet is impacted as well :( [20:37:20] so we cant use it as a fallback in case tin explode [20:37:57] (03PS1) 10Paladox: Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 [20:38:06] hasharAway ^^ done :) [20:38:13] now gotta see if it fails jenkins [20:38:29] https://integration.wikimedia.org/ci/job/integration-jjb-config-diff/6091/console [20:38:30] failed [20:38:55] (03CR) 10jenkins-bot: [V: 04-1] Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 (owner: 10Paladox) [20:39:01] Oh [20:39:13] thcipriani: oh and on preliminary setup of a new dpeloyment server, the keyholder is broken. The key just dont work [20:39:17] restarting the service fix it [20:39:30] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 13Patch-For-Review: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2675477 (10thcipriani) It looks like we removed php5 packages as part of the move to jessie: https://github.com/wikimedia/operations-pupp... [20:39:43] on subsequent reboot it works just all fine though. So the provisionning is broken somehow [20:40:15] when you say "the key just dont work" you mean that they're unusable? [20:40:24] I really wish we can complete the phaseout of Zend in prod [20:40:25] or that they're not listed in keyholder? [20:40:37] keyholder status shows bunch of keys loaded [20:40:49] but ssh to hosts says the key is rejected [20:41:02] anything about "agent refused to sign"? [20:41:07] in the error output [20:41:09] ? [20:41:14] that or private key refused [20:41:16] cant remember :( [20:41:20] something is off by one. But really I dont think it is much of an issue [20:41:26] it is not like we reimage them so often [20:41:47] Yippee, build fixed! [20:41:48] Project beta-scap-eqiad build #122028: 09FIXED in 9 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122028/ [20:41:50] I am willing to forget about the issue :D [20:41:56] fixed? [20:41:58] my theory is that keyholder-proxy is being started before the keys are actually placed on disk, which would cause what you're seeing. [20:42:06] ahh [20:42:36] so we might be missing a dependency order in puppet or in the systemd unit file [20:43:18] yeah, something about notify keyholder-proxy whenever a new key is added to the agent or a new keyholder-auth.d is added to disk. [20:43:37] well [20:43:41] (03PS2) 10Paladox: Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 [20:43:45] keyholder agent doesn't require any re-arming, just reloads the service, so can be done a bunch without any harm. [20:44:06] so yeah it is probably not a big issue [20:44:43] (03CR) 10jenkins-bot: [V: 04-1] Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 (owner: 10Paladox) [20:45:31] (03PS3) 10Paladox: Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 [20:46:32] (03CR) 10jenkins-bot: [V: 04-1] Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 (owner: 10Paladox) [20:47:29] 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2675507 (10hashar) I have dropped deployment-tin02 it was confusing people and create a deployment-tin which is now the master https://gerrit.wikimedia.... [20:47:51] (03PS4) 10Paladox: Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 [20:48:34] !log Deleted deployment-tin02 via Horizon. Replaced by deployment-tin [20:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:50:58] Project beta-update-databases-eqiad build #11681: 15ABORTED in 7 min 58 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11681/ [20:51:28] PROBLEM - Host deployment-tin02 is DOWN: CRITICAL - Host Unreachable (10.68.18.163) [20:51:40] so scap works [20:51:47] mw config changes are applied to tin [20:51:59] the database update job looks stall somehow :( [20:52:12] and that installer output looks really dated [20:52:49] hrm. the database update job I had to run manually at one point for deployment-mira to get it to work the first time. [20:53:14] and by manually, I think I just did a bash loop over the dblist. [20:53:32] the python kept failing weirdly and unexplainably. [20:55:18] (03CR) 10jenkins-bot: [V: 04-1] Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 (owner: 10Paladox) [20:56:35] (03PS5) 10Paladox: Rename toxgen and doxygen to run-tox and run-doxygen in builder: [integration/config] - 10https://gerrit.wikimedia.org/r/313308 [20:57:31] hasharAway it seems nodepool has got slower https://integration.wikimedia.org/zuul/ [20:57:43] with only 3 tests running [20:58:58] paladox: on that page click the nodepool button at the bottom [20:59:03] brings you to https://grafana.wikimedia.org/dashboard/db/nodepool [20:59:14] the first graph (top left) shows the pool state [20:59:18] Oh yes forgot about that [20:59:23] top right in the menu change to one hour [20:59:37] and you get an idea of the business of the pool [20:59:46] the instance launch time is also high [21:00:03] from an ideal <40 secs up to 1,7 minutes [21:00:04] Oh [21:00:16] Yep, guess the launch time may impact [21:00:21] i filled a task for that one. Not sure if it is Nodepool not processing spawn requests fast enough [21:00:29] or if it is the wmflabs rate limiting the instances creation [21:00:40] Oh [21:00:54] probably nodepool [21:01:18] Yep [21:01:19] well actually I have just found what I was looking for :] [21:01:32] Oh [21:01:32] https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=21&fullscreen [21:01:51] max and min [21:01:53] that represent Nodepool internal tasks run per minute [21:02:05] Oh [21:02:08] which max out at either 7 or 8 per minutes [21:02:23] that is due to Nodepool having a rate of one task every 8 seconds [21:02:28] or 7.5 tasks per minutes [21:02:41] Oh, is there a way of lowering that? [21:02:52] and the plateau on that graph exactly correlate with the spikes of instance launch time [21:02:58] yep [21:03:15] Yay hasharAway https://gerrit.wikimedia.org/r/#/c/313308/ now passes [21:03:16] :) [21:03:17] thcipriani: nodepool rate limit has an impact on time to spawn an instance. facts / good reading above ^^ :D [21:04:03] Oh [21:06:25] maybe there's a happier medium between the current rate and what a rate of 1 was doing [21:07:01] (03PS2) 10Paladox: Replace derecated logrotate with build-discarder [integration/config] - 10https://gerrit.wikimedia.org/r/313306 [21:07:53] 1 to 8 is a pretty drastic change. Maybe we could change it to 4? [21:08:53] even a rate of 6 would give us 10 tasks per minute [21:09:38] +1 for 4 :) [21:13:00] it's a balancing act, of course. not just to make sure we're not sending the api more traffic than it can respond to, but to ensure we're being good labs users in general. [21:13:50] 10Continuous-Integration-Infrastructure, 05Continuous-Integration-Scaling, 07Nodepool: Investigate why Nodepool instances are sometime slow to reach READY state - https://phabricator.wikimedia.org/T146813#2675635 (10hashar) So that is definitely due to Nodepool having a rate limit to the OpenStack API. Most... [21:14:14] thcipriani: I am not sure what were the issue with the rate of 1 though [21:14:40] I wonder would this [21:14:40] https://github.com/openstack-infra/nodepool/commit/7682a73c6f4bce39c04d72cd653481223a1ab0b1 [21:14:44] improve things for us? [21:15:35] (03PS12) 10Awight: Use composer in DonationInterface hhvm tests [integration/config] - 10https://gerrit.wikimedia.org/r/301025 (https://phabricator.wikimedia.org/T141309) [21:15:56] paladox: not related [21:16:03] Ok [21:16:10] hasharAway: when the rate was at 1 after the liberty upgrade nodepool kept getting 403s because openstack doesn't have a chance to update the quota after a delete requests happens [21:16:23] at least that was my understanding when we updated that value. [21:16:45] paladox: they have a Nodepool which is just too busy and starve on threads/too many stuff to do. So I guess that commit is to let them have two nodepool to work in parallel [21:16:54] Oh [21:17:07] thcipriani: yeah. I regret having OKed the liberty upgrade [21:17:08] I wonder would two nodepool's work for us [21:17:24] thcipriani: I should have been more careful and said: lets wait for me to come back since Nodepool will surely be impacted :] [21:17:48] then the API would just return a 403 / quota exceeded. So even with a request every second, I dont think it was much overcrowed :] [21:18:11] !log https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/ is broken for unkwnon reason :( [21:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:19:42] stalled on php5 /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=commonswiki --quick [21:19:43] bah [21:19:57] yeah, nodepool was none too happy after the upgrade. If there's reason to believe that the 403s could be mitigated and the rate restored then I think we should try those adjustments. [21:20:26] the quota was off [21:20:41] but it is now being updated on each instance deletion/creation [21:20:46] indeed, there were a few instance that were hidden that were eating quota [21:20:51] unless the update is less than 30 secs iirc [21:20:53] and that fixed it [21:21:22] I guess openstack somehow lost track of instances during the upgrade [21:21:41] most probably because nodepool was still acting on the API while different components were upgraded at different time [21:21:50] I have no idea really. I am just speculating :] [21:23:08] could we try to lower the rate a bit and see if the 403s come back? If not then, well, we have faster CI and openstack is happy, too :) [21:23:18] yeah [21:23:36] I have cced chase/andrew to the task. We will see after their offsite [21:24:40] Project beta-update-databases-eqiad build #11682: 15ABORTED in 33 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11682/ [21:25:23] !log deployment-tin: sudo -H -u www-data php5 /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=commonswiki --quick [21:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:25:36] PHP Notice: Undefined variable: wmgMFUseCentralAuthToken in /srv/mediawiki-staging/wmf-config/mobile-labs.php on line 20 [21:25:36] Notice: Undefined variable: wmgMFUseCentralAuthToken in /srv/mediawiki-staging/wmf-config/mobile-labs.php on line 20 [21:25:36] bah [21:26:19] are they in production? [21:26:24] Project beta-scap-eqiad build #122034: 04FAILURE in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122034/ [21:26:51] nope [21:27:14] scap-master-sync ci-jessie-wikimedia-161630.contintcloud.eqiad.wmflabs [21:27:16] wat. [21:27:26] is this a left over dns bit? [21:27:39] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-CentralAuth, 10MobileFrontend: beta cluster: Notice: Undefined variable: wmgMFUseCentralAuthToken in /srv/mediawiki-staging/wmf-config/mobile-labs.php on line 20 - https://phabricator.wikimedia.org/T146945#2675675 (10hashar) [21:27:59] thcipriani: yeah looks like :( [21:28:05] need a task to get rid of it [21:28:12] though I think Krenair can delete DNS entries [21:28:13] also [21:28:18] that shows something is off in scap [21:28:36] I guess it is being passed an IP and does a reverse lookup [21:28:45] and then use that reverse lookup [21:28:50] I can mess with the eqiad.wmflabs zone, yes [21:28:58] But how badly are things broken right now? [21:29:15] scap broke :( [21:29:34] $ dig +short -x 10.68.21.205 [21:29:34] ci-jessie-wikimedia-161630.contintcloud.eqiad.wmflabs. [21:29:34] deployment-tin.deployment-prep.eqiad.wmflabs. [21:29:46] hasharAway: I think I'd just remove wgMFUseCentralAuthToken [21:29:51] It's not in non -labs.php [21:29:53] hasharAway im wondering weather we should up the time it takes zuul to start a job? [21:29:59] No sign of it in IS [21:30:02] Would that make nodepool a little more reliable [21:30:05] or not? [21:30:10] maybe like 5 to 6 or 7 [21:30:24] Reedy: I have no idea what that variable is for. Maybe it is just a left over from an experimentation [21:31:03] paladox: I dont see how it would help [21:31:23] 21.205... hmm [21:31:28] hasharAway if it takes a little longer to start then nodepool may have a chance to catchup [21:31:28] Update 10.68.21.205 to remove [u'ci-jessie-wikimedia-161630.contintcloud.eqiad.wmflabs.'] [21:31:29] yeah [21:31:31] with the tests [21:33:03] also the 10.in-addr.arpa. zone I guess [21:36:25] Project beta-scap-eqiad build #122035: 04STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122035/ [21:38:39] 10Beta-Cluster-Infrastructure: Beta cluster database update job is stall, because of Swift? - https://phabricator.wikimedia.org/T146947#2675726 (10hashar) [21:39:06] Krenair: I thought you got rid of all those fake entries? [21:39:30] I did [21:39:52] contintcloud spawns and deletes instances like mad, and doesn't seem to care about cleaning up after itself [21:40:04] well [21:40:07] let me rephrase that :D [21:40:17] ci-jessie-wikimedia-161630 has an id which is quite old [21:40:19] months old [21:40:30] and Nodepool / contintcloud does not do anything with DNS [21:40:37] it just ask the api to spawn or delete one [21:40:40] months old? huh [21:40:46] the DNS leak being somewhere in openstack [21:41:07] yeah we are at id 384xxx now [21:41:19] I'm about to go afk, and can't look into this level of detail myself anyway [21:41:24] I'll ask ops about it next week [21:41:29] I commented on the task how much of those false PTR are from back in march-april [21:41:39] so potentially we had a DNS leak at that time which is now mostly resolved [21:41:46] My script should've cleaned up stuff from march-april [21:41:48] just giving some background :D [21:42:19] Krenair: get AFK :] And thank you very much for the PTR deletion! [21:42:32] Maybe we're actually encountering a designate bug [21:42:33] later [21:42:58] !log beta cluster update database is broken :/ Filled T146947 about it [21:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:44:02] I am disappearing [21:45:15] hasharAway you can upload mutiple images with https://github.com/openstack-infra/nodepool/commit/e078dcdef415c9998682248104d015b603ec58b1 [21:45:16] :) [21:46:30] Project beta-scap-eqiad build #122036: 04STILL FAILING in 1 min 42 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122036/ [21:48:37] !log deployment-tin: service nscd restart [21:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:49:38] Project beta-scap-eqiad build #122037: 04STILL FAILING in 1 min 38 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122037/ [21:50:35] huh [21:50:55] same DNS entry :D [21:51:04] I am giving up it is too late [21:51:11] my best bait is REBOOT !!!!!!!!!!!!!!!!! [21:51:28] and really [21:51:35] hasharAway i wonder if https://github.com/openstack-infra/nodepool/commit/3d5190c19aa803b20b4964e37fef9d2d10fdb9b6 fixes anything for us [21:51:40] looks like jenkins related [21:51:44] according to commit msg [21:51:45] scap should probably not reuse a random hostname that comes from the PTR [21:51:55] Project beta-scap-eqiad build #122038: 04STILL FAILING in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122038/ [21:52:20] paladox: openstack is sending dozen of commits per day. So yeah some are fixing stuff, some might be of interest. But really I tend to review them in bulk when enough are of interest [21:52:33] Oh [21:52:48] I guess we should update to see weather it does anything good for us :) [21:52:48] paladox: and we cant upgrade Nodepool for now, blocked on an externdal dependency or migrating to a different deployment system (from .Deb to scap3 maybe) [21:52:56] Oh [21:53:09] thcipriani: I am giving up :/ [21:53:14] looks like most stuff work [21:53:28] update.php is stuck for some reason. Apparently while authenticating to swift for some reason [21:53:40] beside that rest looks fine. So I am going to crash to bed [21:53:44] yeah, I don't know what's going on with scap, I'll take a look. The database thing may be beyond what I know how to fix, but I'll take a look. [21:54:00] for the ptr [21:54:05] maybe just reboot [21:54:08] which would clear caches :] [21:54:09] hasharAway we could use [21:54:11] https://pypi.python.org/pypi/shade [21:54:20] instead off [21:54:20] paladox: yeah that is the blocker actually :D [21:54:20] https://packages.debian.org/stretch/python-shade [21:54:25] yep [21:54:40] But coulden we use the one from pypi [21:54:42] ? [21:54:42] look at phabricator for the task ;] namely the challenge is figuring out a proper version of shade and package it [21:54:46] Yep [21:54:46] then try the upgade [21:54:47] foun dit [21:54:53] https://phabricator.wikimedia.org/T107267 [21:54:53] we dont use pypi [21:54:56] oh [21:55:05] PROBLEM - Puppet run on deployment-tin is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:55:11] you dont want to install random software out of the internet as root! [21:55:24] yep [21:55:41] so either we use a .deb package [21:55:54] I guess we could create a repo and import it from upstream [21:55:58] and create the deb from there [21:56:00] for jessie [21:56:05] or we build a binary/zip that we control , push it to a Gerrit repo and deploy that with our deploy tool (scap3) [21:56:21] Which one would you find better to do? [21:56:30] none [21:56:31] Project beta-scap-eqiad build #122039: 04STILL FAILING in 1 min 48 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122039/ [21:56:35] Oh [21:56:40] I dont want to upgrade nodepool anytime soon :] [21:56:43] Oh [21:56:49] got too much on my shoulders already [21:56:58] ok [21:57:26] * hasharAway disappears [21:59:53] Project selenium-Core » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #168: 04FAILURE in 7 min 53 sec: https://integration.wikimedia.org/ci/job/selenium-Core/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/168/ [22:01:25] What about https://github.com/openstack-infra/nodepool/commit/279f4607e9a49871275295b404fe224c47603b08 ? [22:03:21] Yippee, build fixed! [22:03:22] Project beta-scap-eqiad build #122040: 09FIXED in 1 min 47 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/122040/ [22:03:30] or this one [22:03:31] https://github.com/openstack-infra/nodepool/commit/2cc94909017d40f48742de99ac7eb2a3c2ee1234 [22:05:06] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [22:09:41] Project beta-update-databases-eqiad build #11683: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11683/ [22:09:59] This one https://github.com/openstack-infra/nodepool/commit/e1bfe6dcb8b8540ecd3e513d725765a3d2a7f783 seems like something that will improve nodepool [22:12:16] 10Continuous-Integration-Infrastructure, 07Nodepool: Update Nodepool to catch up with upstream master branch - https://phabricator.wikimedia.org/T144601#2675837 (10Paladox) @hashar we could backport some commits that look like they improve things, without backporting everything and waiting for a while until we... [22:20:04] 10Beta-Cluster-Infrastructure: Beta cluster database update job is stall, because of Swift? - https://phabricator.wikimedia.org/T146947#2675850 (10AlexMonk-WMF) Yeah, deployment-ms-be01 is broken. I can't ssh in, or use salt against it. [22:23:07] 10Beta-Cluster-Infrastructure: Beta cluster database update job is stall, because of Swift? - https://phabricator.wikimedia.org/T146947#2675856 (10AlexMonk-WMF) Console log just shows the login prompt... I can ping it, and ssh does get part way through the login process: ``` debug3: send packet: type 5 debug3:... [22:36:07] Yippee, build fixed! [22:36:07] Project beta-update-databases-eqiad build #11684: 09FIXED in 16 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/11684/ [22:43:12] 10Beta-Cluster-Infrastructure: /srv/vdb almost full on deployment-cache-upload04 - https://phabricator.wikimedia.org/T146952#2675903 (10AlexMonk-WMF) [22:43:32] 10Beta-Cluster-Infrastructure: /srv/vdb almost full on deployment-cache-upload04 - https://phabricator.wikimedia.org/T146952#2675916 (10AlexMonk-WMF) [22:44:33] 10Beta-Cluster-Infrastructure: Beta cluster database update job is stall, because of Swift? - https://phabricator.wikimedia.org/T146947#2675918 (10AlexMonk-WMF) 05Open>03Resolved a:03AlexMonk-WMF [22:51:11] 06Release-Engineering-Team, 10Phabricator, 13Patch-For-Review, 07Wikimedia-Incident: Contention on search phabricator database creating full phabricator outages - https://phabricator.wikimedia.org/T146673#2675939 (10mmodell) [23:31:10] PROBLEM - Free space - all mounts on deployment-cache-upload04 is CRITICAL: CRITICAL: deployment-prep.deployment-cache-upload04.diskspace._srv_vdb.byte_percentfree (<100.00%) [23:56:54] !log Deleted varnish cache files on deployment-cache-upload04 to free up space, disk full [23:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [23:57:44] no bot? [23:59:45] no, I'm just blind