[00:14:40] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.29.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T155527#3031697 (10thcipriani) [00:21:26] Yippee, build fixed! [00:21:26] Project beta-update-databases-eqiad build #15017: 09FIXED in 1 min 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/15017/ [01:03:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [01:10:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [01:13:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [01:32:09] 10Gerrit, 10QuarryBot-enwiki: [Gerrit Repo Request] Request for tools.Quarrybot-enwiki - https://phabricator.wikimedia.org/T154880#3031900 (10Zppix) [02:31:56] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [02:37:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [02:40:56] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 0 below the confidence bounds [02:46:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 0 below the confidence bounds [03:06:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [03:57:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds [04:00:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds [04:15:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [05:03:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [05:11:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [05:16:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [06:00:21] Project selenium-Wikibase » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #271: 04FAILURE in 1 hr 20 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/271/ [06:10:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [06:28:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 0 below the confidence bounds [06:32:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds [06:40:03] Yippee, build fixed! [06:40:04] Project selenium-Wikibase » chrome,test,Linux,contintLabsSlave && UbuntuTrusty build #271: 09FIXED in 2 hr 0 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/271/ [07:11:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 2 below the confidence bounds [07:12:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [07:27:40] 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-InterwikiSorting, 06Operations, 10Wikidata, and 4 others: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995#2960964 (10Nikerabbit) This broke the compact language links based on comment T153900#3011037. I'm submitting... [08:34:28] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [08:41:36] 06Release-Engineering-Team (Long-Lived-Branches), 10Scap, 06Operations, 13Patch-For-Review: Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#3032403 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is completed, `use_experimental` can be removed once deploy... [09:28:10] 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review: Alert when Zuul/Gearman queue is stalled - https://phabricator.wikimedia.org/T70113#3032456 (10hashar) After a night of monitoring, the alarms reported on IRC and the [[ associated graph | https://grafana-admin.wikimedia.org/dashboard... [09:57:00] 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review: Alert when Zuul/Gearman queue is stalled - https://phabricator.wikimedia.org/T70113#3032514 (10hashar) The code is faulty. Whenever the json return a null value, python makes it a `None` and `0` is superior than `None` ``` >>> json.lo... [10:27:01] 10Browser-Tests-Infrastructure, 10Wikidata, 13Patch-For-Review, 15User-Tobi_WMDE_SW, 15User-zeljkofilipin: Increase in failures caused by Saucelabs - https://phabricator.wikimedia.org/T152963#3032562 (10hashar) From http://stackoverflow.com/questions/12787032/handling-exceptions-on-cucumber-scenarios?rq=... [10:38:45] 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review: Alert when Zuul/Gearman queue is stalled - https://phabricator.wikimedia.org/T70113#3032570 (10hashar) Hacking the script a bit ``` + for upperconf in result[2]['datapoints']: + print time.strftime("| %Y-%m-%d %H:%... [11:35:41] hashar: zeljkof: do you have any clue what happened to this job? [11:35:43] https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/271/console [11:35:58] 06:00:21 /srv/jenkins-workspace/workspace/selenium-Wikibase/BROWSER/chrome/MEDIAWIKI_ENVIRONMENT/beta/PLATFORM/Linux/label/contintLabsSlave && UbuntuTrusty does not exist. [11:36:04] doesn't soung good [11:36:22] it broke in the middle of the run [11:36:47] ouch [11:36:53] all sorts of errors [11:36:55] from No such file or directory - getcwd (Errno::ENOENT) [11:37:01] yes [11:37:02] to a to of java ones [11:37:07] strange.. hope this was a one-timer [11:37:12] probably [11:37:19] or something is seriously broken :) [11:37:32] please report if it happens again [11:37:41] rerun the job now [11:37:50] zeljkof: thx! [11:37:52] will do [12:03:52] 10Gerrit, 10UI-Standardization: Make gerrit colors align with WikimediaUI color palette - https://phabricator.wikimedia.org/T158298#3032714 (10Ladsgroup) [12:07:40] 10Gerrit, 10UI-Standardization: Make gerrit colors align with WikimediaUI color palette - https://phabricator.wikimedia.org/T158298#3032731 (10Ladsgroup) | Type | Before | After | Post button| {F5654410} |{F5654418} [12:35:44] 10Browser-Tests-Infrastructure, 10Wikidata, 13Patch-For-Review, 15User-Tobi_WMDE_SW, 15User-zeljkofilipin: Increase in failures caused by Saucelabs - https://phabricator.wikimedia.org/T152963#3032804 (10zeljkofilipin) Sauce labs support ticket (not public): https://support.saucelabs.com/hc/en-us/requests... [12:37:34] Tobi_WMDE_SW: zeljkof: pretty sure that is because of the spaces in the label selector [12:37:51] in matrix jobs, Jenkins uses the label to forge the workspace directory [12:38:03] so given it has an axis value of label="contintLabsSlave && UbuntuTrusty" [12:38:09] it creates a dir such as ...../contintLabsSlave && UbuntuTrusty/ [12:38:19] and at some point we have a script that must be doing something like: [12:38:21] cd $WORKSPACE [12:38:23] which fails [12:38:40] because of the whitespace, the shell ends up tying to do: [12:38:56] hashar: but has something changed recently? I've never seen such error before.. [12:38:59] cd ...label/contintLabsSlave && UbuntuTrusty [12:39:13] or it tries to cd to a non existing path and run the UbuntuTrusty command [12:39:34] something got changed somewhere I guess :/ [12:39:59] hmm [12:40:01] in Junit apparently [12:40:05] the stacktrace is java [12:40:09] hm [12:40:15] but it works for the other job? [12:40:25] runs fine for test, but fails for beta? [12:40:29] 01:20:07.564 at org.apache.tools.ant.types.AbstractFileSet.getDirectoryScanner(AbstractFileSet.java:460) [12:40:31] 01:20:07.564 at hudson.tasks.junit.JUnitParser$ParseResultCallable.invoke(JUnitParser.java:127) [12:40:39] /srv/jenkins-workspace/workspace/selenium-Wikibase/BROWSER/chrome/MEDIAWIKI_ENVIRONMENT/beta/PLATFORM/Linux/label/contintLabsSlave && UbuntuTrusty does not exist. [12:41:29] or the path really does not exist [12:41:38] but really no [12:42:39] there is also an error from water require No such file or directory - getcwd (Errno::ENOENT) [12:42:59] maybe the workspace has been deleted magically ? :( [12:43:45] integration-slave-trusty-1003:~$ cd /srv/jenkins-workspace/workspace/selenium-Wikibase/ [12:43:45] -bash: cd: /srv/jenkins-workspace/workspace/selenium-Wikibase/: No such file or directory [12:43:54] did anybody rerun the job to see if it happened just once for mysterious reasons? or if it is consistent [12:44:29] looks like the workspace has been deleted somehow [12:48:22] I blame jenkins trolls ;P [12:51:07] 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review: Alert when Zuul/Gearman queue is stalled - https://phabricator.wikimedia.org/T70113#3032844 (10hashar) https://gerrit.wikimedia.org/r/#/c/338095/ lets one specific to check_graphite a minimum value the Holt-Winters confidence upper ba... [12:51:43] Tobi_WMDE_SW: one off error. The workspace /srv/jenkins-workspace/workspace/selenium-Wikibase/ has somehow been deleted while the job was running, and honestly I have no idea what might have happened [12:51:55] puppet log doesn't show anything suspicious [12:52:01] nobody logged on the instance [12:52:05] so it is a complete mystery to me [12:52:16] * zeljkof is out of lunch [12:52:32] * zeljkof blames jenkins, for realz [12:53:55] rebuilding the job [13:24:05] * zeljkof is back [13:33:10] there adding wip to gerrit core now :) [13:47:04] Yippee, build fixed! [13:47:05] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #308: 09FIXED in 3 min 3 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/308/ [13:49:09] PROBLEM - Puppet run on buildlog is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [13:53:40] hashar: ohh.. ok. :) strange [13:53:59] some trolls living inside jenkins. :) [13:56:45] that's what I have been saying for years... [13:57:00] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:06:25] zeljkof: :D [14:44:42] (03PS2) 10Hashar: (WIP) Wikibase jobs on Nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/333253 [14:49:00] (03PS3) 10Hashar: Wikibase experimental jobs on Nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/333253 [14:53:20] (03CR) 10Hashar: [C: 032] Wikibase experimental jobs on Nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/333253 (owner: 10Hashar) [14:53:34] zeljkof: will try Wikibase on Nodepool :) [14:53:47] hashar: yeah! [14:53:50] I wasn't sure how to test it [14:53:57] but experimental pipeline is exactly meant for that [14:54:13] did a bunch of copy pasta though [14:54:19] (03Merged) 10jenkins-bot: Wikibase experimental jobs on Nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/333253 (owner: 10Hashar) [14:55:45] bah [14:55:53] there are so many jobs in the experimental pipeline [14:57:34] notice that for a few repos [14:57:46] it is a bit out of control [14:58:14] * hashar watches https://integration.wikimedia.org/ci/job/mwext-Wikibase-client-tests-mysql-hhvm-jessie/2/console [15:01:59] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [15:02:23] hmm [15:02:25] most are success [15:02:27] that is suspect [15:03:09] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 07Ruby, 15User-zeljkofilipin: Selenium tests broken on Ruby 2.4 - https://phabricator.wikimedia.org/T157695#3033158 (10zeljkofilipin) [15:03:12] 10Browser-Tests-Infrastructure, 07Ruby, 15User-zeljkofilipin: Update Ruby tests to Selenium 3 - https://phabricator.wikimedia.org/T158074#3033157 (10zeljkofilipin) [15:07:29] (03PS2) 10Zfilipin: WIP Update Ruby tests to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) [15:09:37] (03PS3) 10Zfilipin: WIP Update Ruby tests to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) [15:11:27] Yippee, build fixed! [15:11:27] Project selenium-Wikibase » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #272: 09FIXED in 2 hr 17 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/272/ [15:14:07] (03CR) 10jerkins-bot: [V: 04-1] WIP Update Ruby tests to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) (owner: 10Zfilipin) [15:14:37] (03PS1) 10Hashar: Wikibase jobs to Nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/338121 [15:17:16] (03CR) 10Hashar: [C: 032] Wikibase jobs to Nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/338121 (owner: 10Hashar) [15:21:53] zeljkof: and I think I will create a few slaves dedicated to running the browser tests jobs [15:22:09] not sure whether the related puppet class manages to populate everything we need though. [15:22:14] I will give it a try eventually [15:22:43] hashar: great! [15:22:55] from a random discussion with Tyler last week [15:23:06] I think I will set up a standalone Jenkins master for the browser tests jobs [15:23:17] would be easier to manage than having a single jenkins with everything [15:23:29] (03Merged) 10jenkins-bot: Wikibase jobs to Nodepool [integration/config] - 10https://gerrit.wikimedia.org/r/338121 (owner: 10Hashar) [15:23:30] not sure yet how to provision a jenkins master though [15:23:39] Wikibase jobs are on Nodepool now :) [15:25:42] hashar: would you do me a favor and figure out the ratio of jessie vs trusty jobs over the last few weeks on nodepool? [15:27:01] chasemp: sure thing [15:27:49] chasemp: for the Nodepool instance in number of builds per day over 90 days : https://grafana.wikimedia.org/dashboard/db/nodepool-migration?panelId=26&fullscreen [15:28:00] the tooltip in grafana is wrong [15:28:09] but roughly 80% Jessie 20% Trusty [15:28:21] can we turn that into an aggregate pie graph? (graphite can not sure if grafana) [15:28:25] I stopped triggering PHP5 jobs last week iirc [15:28:32] they only run on CR+2 nowadays [15:28:49] seems like a solid trend there [15:29:12] yeah busy busy :} [15:29:23] I want to get rid of Trusty entirely eventually [15:29:30] let's adjust the ratio of ready nodes this week? [15:30:11] o/ does anyone know if renaming a Wikimedia GitHub repo would break mirroring? I believe GitHub redirects old URL usages but I wondered if anyone knew [15:30:39] chasemp: ready is 6 jessie / 3 trusty. Guess you can turn trusty down to 2 [15:30:54] (03PS4) 10Zfilipin: WIP Update Ruby tests to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) [15:30:58] and maybe bump the # of ready jessie a little. It doesn't hurt to have instances already ready [15:31:11] I think we should up jessie too yeah [15:31:13] considering [15:31:23] niedzielski: just rename it, at worth we will notice in the Gerrit replication logs and adjust [15:31:25] I don't want to do this today in case [15:31:33] but possibly tomorrow (andrew is out atm) [15:31:48] niedzielski: I really don't know what happens. Cause really the Gerrit repo is mapped to a github project name in a deterministic way [15:31:59] niedzielski: so if you remove the GitHub repo, you would really want to first rename the Gerrit repo [15:32:15] niedzielski: eg in Gerrit: foo/bar is hardcoded to become in github wikimedia/foo-bar [15:32:31] if you rename in github foo-bar to barfoohipxx [15:32:37] gerrit will NOT replicate [15:32:56] well it will, but to the other name. And no idea how github handles that [15:33:04] chasemp: it is harmless really [15:33:20] nodepool even pick the config on each tick [15:33:37] I prefer to wait as I don't have to time to debug unintended consequences today [15:34:19] (03CR) 10jerkins-bot: [V: 04-1] WIP Update Ruby tests to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) (owner: 10Zfilipin) [15:35:48] hashar: thanks! [15:37:03] chasemp: you are too paranoid :-} But fair we can do it tomorrow [15:37:11] I added some pie chart on https://grafana.wikimedia.org/dashboard/db/nodepool-migration [15:37:23] but I have no idea what the % represent [15:37:31] "could not draw pie with labels caontained inside canvas" [15:37:55] ah that is the current ratio [15:38:39] (03PS5) 10Zfilipin: WIP Update Ruby tests to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) [15:39:54] chasemp: also what would you need to bump the nodepool quota further up? [15:40:16] we had some noticeable latency over the last couple weeks. devs are buys [15:40:18] busy [15:41:41] the best metric we have in a vacuum there is concurrent used it seems [15:42:23] and if that matches (allocated - 2) iirc [15:42:29] to indicate a lack of available pool [15:43:08] I'll think about how to quantify it when I can, so far we hae looked at wait or length times for tests historically [15:43:14] that has been shown to suck imo [15:43:29] zuul and gearman have their own logic that effects it indepdenently, and tests can take too long to begin with [15:43:52] ratio of pool availability vs used I guess is the sanest thing atm [15:44:06] but yeah, let's bump it up [15:45:16] 10Beta-Cluster-Infrastructure, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: On beta cluster, ORESFetchScoreJob got a HTTP 400 bad request from ores-beta - https://phabricator.wikimedia.org/T157790#3016510 (10Halfak) p:05Triage>03Normal [15:47:41] chasemp: I will fill a task with some updated quota. Guess we can bump it by 20 % ? [15:49:54] (03PS6) 10Zfilipin: Upgrade to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) [15:50:30] is that to 25? [15:50:39] I don't recall what it is now since there are like 3 numbers [15:51:08] have to check w/ andrew etc but something in there, we'll get back some mojo just by reducing ready trusty jobs since those are reducing over time [15:51:15] and yet consuming held ready slots [15:51:25] if you could kick off that task that would be cool [15:53:57] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 07Ruby, 15User-zeljkofilipin: Selenium tests broken on Ruby 2.4 - https://phabricator.wikimedia.org/T157695#3033310 (10zeljkofilipin) Upgrade to Selenium 3 was easy: https://gerrit.wikimedia.org/r/#/c/336824/ [15:59:30] chasemp: I am crafting a table with various bumps [15:59:36] with all the quota needed [15:59:47] yours to pick one based on what the infra can support [15:59:48] can we keep it simple with one request? [15:59:57] and later on we can just reuse that task and pick another # instance [16:00:04] no, I don't want to do that at all [16:00:10] ha :) [16:00:26] it throws off our tracking of quota bumps [16:00:34] ok [16:00:35] and makes it hard to reason on what was done and when for how we review things [16:00:55] better to have one task for one bump in quota with one rational that makes sense as of now [16:00:56] should I set a request to go from 19 instances to 25 ? [16:01:00] sure [16:01:09] filling bits in the table :) [16:13:44] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: Nodepool quota bump - https://phabricator.wikimedia.org/T158320#3033347 (10hashar) [16:14:41] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: Nodepool quota bump - https://phabricator.wikimedia.org/T158320#3033347 (10hashar) [16:14:48] chasemp: done and made it a sub task of the tracking task for quota increase [16:23:32] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 07Ruby, 15User-zeljkofilipin: Selenium tests broken on Ruby 2.4 - https://phabricator.wikimedia.org/T157695#3033442 (10zeljkofilipin) 05Open>03Resolved Update to Selenium 3 fixes tests on Ruby 2.4. [16:25:33] (03CR) 10Zfilipin: "This also fixes T157695" [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) (owner: 10Zfilipin) [16:26:52] (03CR) 10Zfilipin: "All mediawiki/selenium tests pass. Also, all mediawiki/core selenium tests pass when using this patch. I think it is safe to merge it. Aft" [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T158074) (owner: 10Zfilipin) [16:27:52] (03PS7) 10Zfilipin: Upgrade to Selenium 3 [selenium] - 10https://gerrit.wikimedia.org/r/336824 (https://phabricator.wikimedia.org/T157695) [16:30:56] !log integration: created browsertests-1001 intended to run the daily browser tests later on [16:30:59] zeljkof: ^^^:) [16:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:31:10] will provision it and try to run tests on it [16:31:12] hashar: nice [16:31:28] so the selenium-* jobs will run on dedicated instances [16:33:45] !log deploying ores:e9bbda3 [16:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:33:58] 10Browser-Tests-Infrastructure, 07Ruby, 15User-zeljkofilipin: Move rake and rubocop dependency from repositories to mediawiki/selenium - https://phabricator.wikimedia.org/T158326#3033457 (10zeljkofilipin) [16:34:53] https://integration.wikimedia.org/ci/computer/browsertests-1001/ but not provisioned yet [16:41:07] 10Browser-Tests-Infrastructure, 07Ruby, 15User-zeljkofilipin: Move rake and rubocop dependency from repositories to mediawiki/selenium - https://phabricator.wikimedia.org/T158326#3033508 (10zeljkofilipin) p:05Triage>03Low [16:41:45] 10Browser-Tests-Infrastructure, 07Ruby, 15User-zeljkofilipin: Move Rake and RuboCop dependency from repositories to mediawiki/selenium - https://phabricator.wikimedia.org/T158326#3033457 (10zeljkofilipin) [16:43:05] (03PS1) 10Zfilipin: Move Rake and RuboCop dependency from repositories to mediawiki/selenium [selenium] - 10https://gerrit.wikimedia.org/r/338137 (https://phabricator.wikimedia.org/T158326) [16:53:04] zeljkof: in the selenium-* jobs we never target a localhost mediawiki install do we ? [16:53:23] I don't think so [16:53:33] it's beta cluster mostly [16:53:41] sometimes testwiki or something like that [16:54:37] and do they use a local web driver / chromedriver [16:54:43] or are they all using saucelabs? [16:54:55] I populated the new instance browsertests-1001 [16:55:16] the role class installs xvfb / chromedriver / redis [16:55:22] all saucelabs [16:55:23] I guess we just need ruby :} [16:55:28] bah [16:55:34] guess I will create yet another class [16:55:36] yes, you do not even need xvfb [16:55:38] or redis [16:55:39] and maybe name them saucelabs [16:56:14] !log integration: provisioned browsertests-1001 with role::ci::slaves::browsertests . Added it to Jenkins with label BrowserTests [16:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:56:23] guess tomorrow I can try to run a random job on it [17:00:02] lets see https://integration.wikimedia.org/ci/view/Selenium/job/selenium-Core-hashar-browsertests-1001/2/console [17:08:29] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: zuul-merger git-daemon process is not start properly by systemd ? - https://phabricator.wikimedia.org/T157785#3033631 (10hashar) Patch failed with: Error: Failed to apply catalog: Parameter provider failed on Servic... [17:09:54] Project selenium-Core-hashar-browsertests-1001 » firefox,beta,Linux,BrowserTests build #2: 04FAILURE in 10 min: https://integration.wikimedia.org/ci/job/selenium-Core-hashar-browsertests-1001/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/2/ [17:14:09] greg-g: Once the security review is done, I think https://phabricator.wikimedia.org/T132058 is going to be ready to go, just a heads up [17:14:20] I'll ping you again about scheduling [17:16:39] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: zuul-merger git-daemon process is not start properly by systemd ? - https://phabricator.wikimedia.org/T157785#3033657 (10Paladox) @hashar sysvinit is not a valid provider in service puppet code, see https://docs.puppet.c... [17:16:48] marktraceur: is it on beta yet? (I assume not since no security review) [17:16:54] marktraceur: but, cool! [17:20:03] greg-g: Not on beta yet, that's what we need to schedule :) [17:20:25] greg-g: The schedule on the security review ticket says deploy to beta first week of March, then push to testwikis, then to Commons. [17:21:55] gotcha, yeah, go to beta anytime post positive security review :) [17:22:03] Done. [17:39:02] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Make it possible to execute tests as a specific (new) MediaWiki user on beta cluster - https://phabricator.wikimedia.org/T152432#3033747 (10zeljkofilipin) After some testing, looks like it is already supported. Feature: ```lang=gherk... [17:44:29] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Make it possible to execute tests as a specific (new) MediaWiki user on beta cluster - https://phabricator.wikimedia.org/T152432#3033803 (10zeljkofilipin) 338146 is an example patch. Test it with: ```lang=sh $ git review -d 338146 $... [17:46:55] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: zuul-merger git-daemon process is not start properly by systemd ? - https://phabricator.wikimedia.org/T157785#3033813 (10Paladox) heres a systemd file for git-daemon http://pkgs.fedoraproject.org/cgit/rpms/git.git/commit... [17:47:09] 10Browser-Tests-Infrastructure, 13Patch-For-Review, 15User-zeljkofilipin: Make it possible to execute tests as a specific (new) MediaWiki user on beta cluster - https://phabricator.wikimedia.org/T152432#3033817 (10zeljkofilipin) a:05zeljkofilipin>03None As far as I can see, all you need is already implem... [18:27:10] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [18:28:42] tests doint look like there working on https://integration.wikimedia.org/zuul/ [18:29:16] nodepool looks down. [18:29:21] chasemp ^^ [18:30:40] Not sure who else to ping for https://integration.wikimedia.org/zuul/ as hashar is not online. [18:31:40] there are no nodepool instances listed in Jenkin's view: https://integration.wikimedia.org/ci/ [18:32:06] hrm, damn, they're all in "delete" state [18:32:07] Do we have a nice way yet to run mw maintenance scripts from outside the wmf tree? [18:32:08] !log no active nodepool instances listed in Jenkin's view: https://integration.wikimedia.org/ci/ but zuul has plenty to do https://integration.wikimedia.org/zuul/ [18:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:32:58] Wasn't something changed about nodepool instances earlier? [18:33:09] Reedy today? [18:33:10] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [18:34:22] hrm, a few are building now [18:34:33] I restarted nodepool [18:34:35] what happened? weird [18:34:36] the daemon had crashed [18:34:40] ahhhh, thanks chase [18:34:51] !log chase restarted nodepool, the daemon crashed [18:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:35:11] | a2ad7d7d-d006-434e-beb5-83378e93d992 | ci-jessie-wikimedia-530864 | BUILD | | [18:35:11] | 0e15ad62-3fea-4479-825a-23dcf3d574f9 | ci-jessie-wikimedia-530863 | ERROR | | [18:35:11] | 41d0babe-cd6f-4fbf-9441-9257cbe99804 | ci-jessie-wikimedia-530862 | ERROR | | [18:35:14] | f42e525e-e6d5-4291-a87d-26cfe9463136 | ci-jessie-wikimedia-530861 | ERROR | | [18:35:15] | e48dcd97-eb7e-444c-844b-be544170b8c3 | ci-jessie-wikimedia-530860 | ERROR | | [18:35:17] | fbfec1ea-e086-47b5-b864-4e2afa44d951 | ci-trusty-wikimedia-530859 | ERROR | | [18:35:19] | a4679f6f-c735-4701-bff7-73243be83893 | ci-trusty-wikimedia-530858 | ERROR | | [18:35:21] | df3cd2d8-3b07-4a7e-a18f-f7bcf5c8d864 | ci-trusty-wikimedia-530857 | ERROR | [18:35:47] that image is bad or something [18:37:00] hrm, that's 2 images, we've had them both for 4 hours [18:41:06] when I look at openstack server show name I see a fault of: {u'message': u'[Errno 28] No space left on device', u'code': 500, u'created': u'2017-02-16T18:35:45Z'} [18:41:23] hm [18:41:36] I am going to stop nodepool and do a cleanup here [18:41:49] and see if this is an issue it got itself into that it is transient or persists [18:41:51] objection? [18:41:53] I don't know what device that's referring to, labnodepool1001 is fine disk-space wise [18:41:57] no objection [18:42:00] go for it [18:44:33] thcipriani: I don't really grok what's going on here so we may pull some things off nodepool for a bit or consider it bypassable? [18:44:37] this is something seriously weird to me [18:45:34] chasemp: you mean we need to move CI stuff off nodepool? [18:45:59] usually when something is up legoktm or hashar will do some amount of that if needed iirc [18:46:00] didn't parse your last statement [18:46:02] just asking the question [18:46:18] or at least saying, you guys should put a notice wherever ppl look to see if CI is borked [18:47:09] gotcha, ok [18:47:59] 10Continuous-Integration-Config, 10Page-Previews, 06Reading-Web-Backlog, 07Browser-Tests, and 2 others: add rake entrypoint and rubocop to ext:Popups - https://phabricator.wikimedia.org/T136285#3034055 (10Jdlrobson) [18:48:22] 10Continuous-Integration-Config, 10Page-Previews, 06Reading-Web-Backlog, 07Browser-Tests, and 2 others: add rake entrypoint and rubocop to ext:Popups - https://phabricator.wikimedia.org/T136285#2329626 (10Jdlrobson) FYI we might consider moving our browser tests to the node stack and not do this. [18:48:26] 10MediaWiki-Releasing, 10MediaWiki-Containers, 06Services, 15User-mobrovac: Ready-to-use Docker package for MediaWiki - https://phabricator.wikimedia.org/T92826#1121273 (10Dan.mulholland) I think this is a great initiative and wish I'd noticed it earlier. I've spent a few years deploying MW and find that... [18:48:28] RainbowSprinkles: could you change the topic here to say CI is broken for the moment? [18:48:33] * thcipriani bad at irc [18:48:50] I'm going to dig into how to move service back over to permanent slaves :( [18:49:24] thcipriani: Channel is +t, so you have to op first [18:49:27] thcipriani: give me another minute tho [18:49:30] I may be figuring something out [18:49:48] chasemp: ok, lemme know [18:50:37] RainbowSprinkles: thank you :) [18:51:02] PROBLEM - Puppet run on deployment-apertium02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:52:55] PROBLEM - Puppet run on deployment-phab01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:53:50] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [18:54:50] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [18:56:38] thcipriani: weird errors in nodepool log [18:56:41] someting somethign db connection? [18:56:54] can you help me look? I don't entirely grok nodepool's deal here [18:57:02] in the debug log? [18:57:09] Feb 16 18:56:43 labnodepool1001 nodepoold[7991]: ManagerStoppedException: Manager contint1001 is no longer running [18:57:16] ah I was looking in systemd logs [18:57:43] hrm. I saw something happen with contint1001/2 this morning... [19:04:56] thcipriani: has anything changed w/ nodepool lately? [19:05:16] not that I'm aware of [19:06:06] I mean, it /seems/ like openstack is bouncing it back w/ a legit error [19:06:15] and nodepool is trying to consume outside of quota [19:06:16] antoine might be call-able now, if you need him [19:06:19] | fixed-ips | 200 [19:06:20] yes [19:06:21] yet [19:06:22] ^ [19:06:36] Failed to allocate the network(s) with error Maximum number of fixed ips exceeded [19:06:48] (yes to me?) [19:07:07] greg-g: could you call? I think there was something done with contint1001 this morning, but I'm not clear what impact it could be having (if any) [19:08:19] also: I think there may have been some kind of jenkins update, which is where this problem may come in, there may be some kind of communication problem between nodepool and jenkins [19:08:36] I mean, I see it bouncing back based on quota it shouldn't have exhausted so I gave it more fixed-ip allocation [19:08:48] and now things are moving a little...I think [19:09:28] I do see a number in "building" [19:09:34] Reedy: can you call antoine again, my phone isn't connecting :/ [19:09:35] whether they are getting used I don't know [19:09:48] what I did to fix was:openstack quota set contintcloud --fixed-ips 250 [19:09:49] from 200 [19:09:58] greg-g: nodepool still? [19:10:05] Reedy: yeah [19:10:10] they're being used: https://integration.wikimedia.org/ci/ [19:10:16] yeah, they seem to be getting used... [19:10:19] 4 right now [19:10:26] sure but it thinks it ate 200 [19:10:41] and when I cleared out nodepool and stopped it it came back w/ a whoel list of "ghost" instances [19:10:46] or instances I know didn't actually exist etc [19:10:58] not according to https://grafana.wikimedia.org/dashboard/db/nodepool [19:11:01] so bumping from 200 to 250 while sane, dosn't make sense [19:11:16] 19 in that pie-chart [19:11:18] I'm going off of 'nodepool list' on teh server [19:11:23] yeah [19:11:26] * greg-g nods [19:11:30] * greg-g shrugs [19:11:32] I def didn't see 200 and can't for the life of me imagine how that's possible [19:11:37] no kidding [19:11:40] but as soon as I upped that specific quota [19:11:42] things started churning [19:11:45] that's odd [19:12:04] yeah, nodepool maintains it's own database of instances, hence the ghost list [19:12:10] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 0 below the confidence bounds [19:12:28] hm [19:12:32] why it was showing instances as unable to delete: I don't know. I tried manually deleting one and it went no where. [19:12:46] I know why, logs had literally filled up on labnet1001 [19:13:00] and it was so nicely failing to execute the operation as it couldn't log it [19:13:16] :/ [19:13:20] the new novaobserver stuff hits so often (it seems) with legit requests it's filling teh logs fast [19:13:40] sorry I didn't mention that [19:13:55] belated log of it [19:14:04] I need to go afk for about 30 minutes, sorry [19:14:05] greg-g: "Was at a meeting. Heading back home, will be there at 20:30" [19:14:16] I dunno if he means 20:30 his or my time [19:14:21] either 15 minutes or 1:15 [19:14:24] So that's either 16 minutes, or 76 mins [19:14:25] Yeah [19:14:31] let's hope 15 :) [19:14:31] I think things are going now thcipriani greg-g [19:14:50] yeah, but... why? [19:14:52] nodepool does seem happier [19:14:57] I can't explain why nodepool supposedly ate up all teh fixed-ip quota atm [19:15:00] I have the same question as to hwy [19:15:30] chasemp: thanks for the quick restart and ip pool bump to get it back going [19:15:39] so openstack thinks we've used 200 ips? [19:16:13] I think there is a condition where resources in flight are removed from teh pool even if failed eventually [19:16:14] like [19:16:25] labnet1001 logs fill up and nodepool keeps scheduling instances [19:16:33] and they are draining the resource pool while not actually completing [19:17:02] and eventually it basically consumes all quota though no effect as you almost have to allocate quota before it's materialized to prevent race conditions on quota usage [19:18:04] * greg-g goes and will return [19:18:56] thcipriani: can you confirm things seem to be off and running? [19:19:25] chasemp: I deep re-reading the last thing you typed, and I don't quite think I understand what you're saying. We consumed all the quota, so we had to allocate new quota to get to the point where we can deallocate quota? [19:19:44] chasemp: yeah, things seem to be working -- thanks for the fix! [19:20:17] I mean something like, instances were not being generated but requests for isntances kept coming [19:20:26] and those requests each took a slot off the quota pool [19:20:33] even if they were never actually fulfilled [19:20:35] the other thing to know is [19:20:46] openstack is terrible about this and keeps quota stored separately from the actual usage [19:20:51] it doesn't tally instances to find instance quota [19:20:59] it keeps a record that it updates periodically as it does operations [19:21:08] so quota belief can be != reality [19:22:05] so you had to manually square the ip quota pool? [19:22:14] or will have to? [19:22:45] or, I guess, you said it will square itself on a long enough timeline? [19:22:50] it should update now that it's going automatically, this is a change I made awhile ago to tell it to update every n [19:23:04] /should/ [19:23:56] so, in theory, if we just waited n it would have squared the quota and started working again? Provided the labnet machine being full was taken care of in that interval? [19:24:03] 10Continuous-Integration-Infrastructure: Cannot access the database: Access denied for user 'jenkins_u0'@'127.0.0.1' to database 'closedwikis' - https://phabricator.wikimedia.org/T157815#3034217 (10Umherirrender) Than it sounds like a similar problem as with Premoderation (T157417) where also the initial edits f... [19:24:06] # Quota drift is a common problem [19:24:06] max_age = 30 [19:24:34] nova.conf [19:25:10] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [19:26:15] and max age is the time it goes before attempting to reconcile quotas with allocations? [19:27:05] I can't remember I'm looking for what exactly that means [19:27:31] I think it's, on doing a successful operation (or maybe any operation?) if quota is older than n update it [19:27:56] RECOVERY - Puppet run on deployment-phab01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:06] hrm, that can't be since we had images in delete for like an hour [19:28:25] well, maybe that's why 1 or 2 would squeak through now and then [19:28:31] o/ [19:28:35] chasemp: Reedy: I am around [19:28:37] Sam poked me over txt [19:28:50] hashar: yeah, all seems to be working now [19:29:00] great! [19:29:10] sorry, I asked him to text since I saw something about contint1001 go through puppet this morning [19:29:13] hey hashar, two things happened probably with causality. labnet logs filled up, and nodepool went a little crazy and openstack thinks it used all it's quota for fixed-ips [19:29:35] and there is something about the nodepool manager in the nodepool logs [19:29:52] so I thought: maybe something with contin1001 puppet changes [19:30:05] thcipriani: heh [19:30:07] " If the quota is already (incorrectly) too high and exceeds the quota limit, the reservation that triggers the refresh will still fail. I.e. the reservation is attempted based on the quota usage value before the refresh." [19:30:16] chasemp: fixed ip that is the 10.68.x.x IP isnt it ? [19:30:20] aint that a bag of crap [19:30:24] hashar: yeah [19:30:43] that doc: ugh [19:30:58] the really blame in Nodepool is that it is not smart enough to find out the cloud provider has some transient troubles [19:31:03] RECOVERY - Puppet run on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:10] nodepool keeps spamming requests no matter what so that pills up quickly [19:31:24] nodepool is a bull in a china shop in that respect yes [19:31:25] yeah, well, nodepool was designed not to care, seemingly [19:31:33] yeah [19:31:41] since it was designed to use a whole bunch of openstacks [19:31:49] though maybe there are some commits in upstream master branch that makes it nicer / more resilient [19:31:49] so if one has problems: not a big deal [19:32:06] mode: keep trying no matter what forever and ever [19:32:46] which makes sense for the original usecase [19:32:56] thcipriani: hashar in about 2 minutes I'm going to see about resetting that quota back [19:33:03] also chasemp you mentioned about having some end to end test that would exercise the whole stack [19:33:08] as I don't want to leave this without some reason and I think based on my udnerstanding we can [19:33:14] from Zuul down to an instance being spinned and a job running on it [19:33:20] I think it is reasonably easy to do [19:33:33] https://phabricator.wikimedia.org/T158054 [19:33:34] yeah [19:33:59] good news, I am alone at home for the next 5 days. I got plenty of time to hack random stuff :] [19:34:18] so, just for clarification, the warning I saw when I did openstack server show about disk being out of space: that was labnet, correct? [19:34:24] thcipriani: yes [19:34:42] it surfaced via nova-api daemon there failing to log teh op and was returned by the api [19:34:44] ok :) [19:34:47] that was legit [19:34:55] the good news is my fullstack test I had running did in fact see that as well [19:35:02] so we would ahve caught it if that was alerting, which is cool [19:35:09] bad news is it wasn't alerting yet [19:35:11] nice [19:35:49] I feel somewhat comfortable saying the increase in activity from novaobserver and nodepool have made the rotations schedule for logs on labnet too lax [19:36:02] and nodepool kept forceably trying to add instances and used up imaginary quota [19:36:16] and it couldn't even get back to fixing the quota as that's based on fixup during an operation [19:36:24] and it had already violated the limit [19:36:33] so I raised the quota and it started rolling and in theory fixed the quota [19:36:39] am I making sense? [19:36:43] yes [19:37:09] totally fun times domino type issue [19:37:10] this jives with my understanding of what you explained that I asked a lot of dumb questions about :) [19:37:33] on phone, will catch up with the backlog in a few [19:37:53] lots of moving pieces: hard to predict something like this error [19:38:09] truly [19:38:18] how long did the issue last for? [19:38:45] also journalctl is the worst thing in live I'm convinced [19:38:48] in life [19:38:53] roughly 2 hours? A little less. [19:39:09] is that right? I just looked at the oldest job in zuul [19:39:42] I don't know how to tell [19:40:30] we need a predictable canary [19:40:36] https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen&from=now-3h&to=now [19:40:51] so at 17:31 [19:41:04] night of the living purple [19:41:14] indeed [19:41:25] I agree that was probably the turn [19:41:37] did our nodepool check fire off at all? [19:41:51] I haven't looked at scrollback, but it definitely should have [19:42:09] when I first jumped on there were machines in delete for quite some time. [19:43:49] thcipriani: I'm going to fix the quota back to "normal" [19:43:52] objection? [19:43:57] re [19:44:09] can you view current allocation? [19:44:29] to ensure we're back down to roughly where we should be? [19:46:58] I am pretty sure we can get OpenStack logs digested by logstash. That would surely help [19:47:01] I'm looking for a command that does it [19:52:10] thcipriani: it's kind of insane there isn't a good way to do that, turns out horizon calculates it internally from a mash of api calls [19:52:15] re: quota usage [19:52:28] and the rabbit hole does deep for getting accurate info from cli [19:52:49] that's awful :D [19:53:20] hashar: +1 to nodepool (and Jenkins, and Zuul?) logs in logstash [19:53:25] I know you have a WIP for Jenkins [19:53:26] ok, well, if my current understanding holds true, our quota should be back to normal [19:53:44] so we can try to relower and keep an eye on nodepool [19:54:10] I'm looking at horizon now to see if it's sane [19:54:39] it looks right [19:54:50] I feel alright to re-lower [19:55:10] +1 [19:55:27] or not [19:55:30] Quota limit 200 for fixed_ips must be greater than or equal to already used and reserved 206. (HTTP 400) (Request-ID: req-60e06084-eeb3-4239-9da1-8ecec2e3bbe3 [19:56:15] ok so new plan, I'm going to look at this for a minute and wait to ask andrew about more drastic reset measures [19:57:01] hrm, ok. [19:57:12] somewhere deep in the bowels of something those failed to allocate instances survive [19:58:25] FWIW, nodepool alien-list is coming up empty [19:58:51] I think this portion of things is on openstacks side [20:04:59] 06Release-Engineering-Team, 06Labs, 06Operations: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10chasemp) [20:05:03] 06Release-Engineering-Team, 06Labs, 06Operations: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034406 (10chasemp) p:05Triage>03High [20:05:31] and that quota doesnt even show up in "nova absolute-limits" bah [20:05:59] 06Release-Engineering-Team, 06Labs, 06Operations: contintcloud project thinks it is using 206 fixed-ip quota errantly - https://phabricator.wikimedia.org/T158350#3034394 (10chasemp) a:03Andrew currently nodepool is going along fine except the quota is clearly wrong. I don't yet understand why the current... [20:06:52] thcipriani: hashar https://phabricator.wikimedia.org/T158350 [20:06:57] will talk to andrew when he returns [20:08:08] chasemp: cool, thank you for your help [20:09:07] ahh [20:12:00] thcipriani: you too :) [20:17:52] thcipriani: is it possible max_age there is minutes and not seconds? [20:18:06] have ot dig up the code [20:18:18] fractions of an hour? [20:18:27] :) [20:18:34] there is a debug logger for nova.quota apparently [20:19:25] maybe it is off in the database :/ [20:19:40] it definitely is [20:19:43] but why [20:20:08] supposedly since you have set it to a max_age of 30s [20:20:17] the quotas should refresh whenever a reservation happens [20:21:06] yep [20:37:48] 10Continuous-Integration-Infrastructure: Create a controlled and ongoing CI pipeline test job that we can alert on - https://phabricator.wikimedia.org/T158054#3034474 (10hashar) Ok here the random crap idea. In Nodepool define a new label attached to an image and with a number of ready instance set to zero. Cr... [20:55:16] 10MediaWiki-Releasing, 10MediaWiki-Containers, 06Services, 15User-mobrovac: Ready-to-use Docker package for MediaWiki - https://phabricator.wikimedia.org/T92826#3034521 (10bd808) >>! In T92826#3034057, @Dan.mulholland wrote: > 3. deploying extensions and skins using Git (i.e. no Composer or minimal compose... [20:57:28] https://www.neowin.net/news/yahoo-is-warning-users-that-their-accounts-have-been-compromised-using-forged-cookies the third breach. [21:02:11] and the question of tonight is [21:02:36] would a job scheduled by Jenkins (not Zuul) ends up being noticed by Nodepool as demand for a label [21:07:47] 10Continuous-Integration-Infrastructure: Create a controlled and ongoing CI pipeline test job that we can alert on - https://phabricator.wikimedia.org/T158054#3034570 (10hashar) Tested. A job scheduled directly by Jenkins bypass the Zuul gearman server as expected and thus Nodepool cant find there is demand for... [21:29:22] (03PS1) 10Hashar: (WIP) Timed build from Zuul [integration/config] - 10https://gerrit.wikimedia.org/r/338179 (https://phabricator.wikimedia.org/T158054) [21:29:34] (03CR) 10Hashar: [C: 04-2] (WIP) Timed build from Zuul [integration/config] - 10https://gerrit.wikimedia.org/r/338179 (https://phabricator.wikimedia.org/T158054) (owner: 10Hashar) [21:30:15] 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Create a controlled and ongoing CI pipeline test job that we can alert on - https://phabricator.wikimedia.org/T158054#3034591 (10hashar) https://gerrit.wikimedia.org/r/338179 is an implementation of plan B. The commit message explains it all but I d... [21:30:52] (03CR) 10jerkins-bot: [V: 04-1] (WIP) Timed build from Zuul [integration/config] - 10https://gerrit.wikimedia.org/r/338179 (https://phabricator.wikimedia.org/T158054) (owner: 10Hashar) [21:41:15] 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review: Alert when Zuul/Gearman queue is stalled - https://phabricator.wikimedia.org/T70113#3034630 (10hashar) A few hours ago we had an outage and the queue pilled up quickly from roughly 17:30 until 19:00. The Icinga alarm were: | 18:27:10... [21:41:40] I am not made for anomaly detections :/ [21:41:54] or I fail to understand Holt-Winter forecasting [21:44:04] hashar hows everything going with releng today? [21:49:53] Zppix: nothing fancy :] [21:50:02] trying to find a nice query to detect an outage such as https://grafana.wikimedia.org/dashboard/db/zuul-gearman?from=1487264064220&to=1487279251013&panelId=10&fullscreen [21:50:05] and that it recovered [21:50:14] issue being too many red [21:50:22] recovery is we get green again and red lowers [21:52:40] hashar why not do something like red for x mins triggers icinga-wm alert to $irc [21:58:16] hashar just out of curiousity wouldnt it make sense to have jenkins stuff in [puppet] repo display in here (im talking about gerrit changes) [22:00:33] Zppix: not sure whether the grrrit spammer can filter based on files changes. But if so yeah probably would make sense to dupe puppet patches here [22:00:38] (and gerrit) [22:00:50] then we are all in #wikimedia-operations already [22:00:55] and only ops can merge the changes [22:01:01] so I am not sure we need the extra spam here [22:01:56] hashar i dont know what its capable since its now merged wih wm-bot [22:02:11] i'd assume simple regex could filter it no? [22:02:47] if commit (or changeset's topic) == * jenkins * then x [22:02:54] something along those lines [22:11:42] 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review: Alert when Zuul/Gearman queue is stalled - https://phabricator.wikimedia.org/T70113#3034700 (10hashar) Maybe just doing a ratio of jobs waiting / jobs running would be sufficient. https://grafana.wikimedia.org/dashboard/db/zuul-gearma... [22:18:39] Zppix: maybe. If it knows about the patchset but I dont think the list of files is given [22:18:52] one would have to fetch the change from gerrit most probably [22:19:05] and then, I dont know whether we want the additional spam in here. Maybe thought [22:19:23] hashar it couldnt query the same api it gets its changes to post to irc and regex it to see if it mentions jenkins? [22:30:03] hashar i was snooping and i know your heavily involved with CI and jenkins' jobs and the latest run of apps-android-wikipedia-periodic-test failed due to emulator fyi (it may need updated) [22:39:53] 10Gerrit, 06Labs, 10Pywikibot-core, 10Tool-Labs: Fresh clone of pywikibot from gerrit fails with error: RPC failed; result=56, HTTP code = 200 on Toollabs (NFS) - https://phabricator.wikimedia.org/T151351#3034803 (10scfc) p:05Triage>03Low [22:41:28] 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review: Alert when Zuul/Gearman queue is stalled - https://phabricator.wikimedia.org/T70113#3034815 (10Dzahn) scheduled a downtime in Icinga for one week, to disable notifications while we are still working on it. it will start to notify agai... [22:45:04] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10hashar) [22:46:56] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034851 (10hashar) ``` $ du -h /var/lib/l10nupdate/caches/ 1.5G /var/lib/l10nupdate/caches/cache-1.29.0-wmf.2 1.5G /var/lib/l10nupdate/caches/cache-1.29.0-wmf.... [22:50:56] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034858 (10hashar) From IRC supposedly we had a cron job to garbage collect the old caches. ``` $ sudo -u l10nupdate -s crontab -l 0 2 * * * /usr/local/bin/l10nupdat... [22:51:12] AAAAAAA! directory has vanished: "/php-1.29.0-wmf.5" (in common) [22:51:24] getting spammed during scap [22:52:10] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10Dzahn) also see: T130317, T133913, T119747 [22:52:28] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034871 (10bd808) [22:52:47] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034872 (10Dzahn) I ran "apt-get clean" on tin which freed another 2G or so [22:55:11] MaxSem: clean up is in progress I guess [22:55:32] MaxSem: see operations channel. / on tin was almost full [22:56:00] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034874 (10Dzahn) [22:56:33] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034831 (10Reedy) I killed the 1.28 l10nupdate cache folders, and the 1.29 ones < .10 [22:57:16] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3034878 (10Dzahn) Makes me think how mira is doing. [22:57:55] * MaxSem just hopes this will not screw his scap up [22:58:39] MaxSem: are you scapping right now? [22:58:57] yep [22:59:07] I mean, it just ended [22:59:39] it should be fine [23:00:03] weee:) [23:00:08] RainbowSprinkles: we should lock scap clean from running when regular scap is running :) [23:00:32] that was me removing stuff while you were syncing [23:01:05] thcipriani: Probably, yes [23:01:16] It should use locks like all AbstractSyncs do [23:01:34] anyone doing anything right now? I want to scap another time just to be sure [23:01:54] Granted we should probably handle locks further down so all scripts get them [23:01:56] I'm clear [23:02:06] ^ MaxSem [23:02:19] cool, running [23:02:40] hrm, FWIW lots of other tools deploy at the same time I'm doing scap stuff for mediawiki [23:02:58] but all the abstractsync related tools should have some overall lock for sure [23:03:00] * Reedy imagines MaxSem on the Jamaican Bobsleigh Team [23:03:16] thcipriani: lock before Application.main() if the app returns should_lock() or something? [23:03:48] Probably removes some dupe code between main.py and deploy.py [23:06:22] hrm, we could lock as soon as Application::run after we load config [23:14:00] Yeah, assuming an application says "I need a lock" [23:14:20] eg: say, deploy-log, sal, wikiversions-inuse clearly don't need locks [23:14:24] Since they do read operations [23:14:50] I'd say either a method in cli.Application that can be overridden, or maybe an @annotation of sorts [23:17:34] decorator would extend a pattern we've been using for creating commands [23:21:12] I wonder if we could even make it a param to the existing cli.command() [23:21:22] needs_lock=>True [23:34:16] 10Gerrit, 10UI-Standardization, 13Patch-For-Review: Make gerrit colors align with WikimediaUI color palette - https://phabricator.wikimedia.org/T158298#3032714 (10JGirault) Are we aiming at several patches here? The ticket description doesn't tell. Otherwise according to the patch above, the title should b... [23:38:08] 10Gerrit, 10UI-Standardization, 13Patch-For-Review: Make gerrit colors align with WikimediaUI color palette - https://phabricator.wikimedia.org/T158298#3032714 (10demon) I wouldn't spend a ton of energy on this to be honest. Gerrit's in the process of swapping to a totally new UI (optional with the next rele... [23:42:03] 10Gerrit, 10UI-Standardization, 13Patch-For-Review: Make gerrit colors align with WikimediaUI color palette - https://phabricator.wikimedia.org/T158298#3032714 (10Volker_E) Citing (myself) from the patch set: > In general I'd recommend not start aligning this tool with WMUI style guide, as it would go far be... [23:46:20] 10Gerrit, 10UI-Standardization, 13Patch-For-Review: Make gerrit colors align with WikimediaUI color palette - https://phabricator.wikimedia.org/T158298#3035085 (10JGirault) Agree with that is said above. I'm also afraid of the scope of this ticket, that's why I asked for more details in the description.