[00:17:21] Yippee, build fixed! [00:17:21] Project selenium-Flow » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #112: 09FIXED in 1 min 20 sec: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/112/ [02:17:25] Yippee, build fixed! [02:17:26] Project selenium-QuickSurveys » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #120: 09FIXED in 4 min 25 sec: https://integration.wikimedia.org/ci/job/selenium-QuickSurveys/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/120/ [03:25:51] PROBLEM - Puppet run on integration-slave-trusty-1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:40:35] 10MediaWiki-Releasing, 05MW-1.26-release: Consider maybe backporting https://gerrit.wikimedia.org/r/#/c/249054/ to last stable - https://phabricator.wikimedia.org/T126344#2556085 (10demon) 05Open>03Resolved a:03demon [04:05:50] RECOVERY - Puppet run on integration-slave-trusty-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [06:42:40] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: Delete ci-trusty-wikimedia-278848 instance in contintcloud project - https://phabricator.wikimedia.org/T143058#2556217 (10Paladox) Oh sorry.. [06:55:44] 10Beta-Cluster-Infrastructure, 10ContentTranslation-CXserver: Move apertium to deployment-sca* hosts in Beta Cluster - https://phabricator.wikimedia.org/T142152#2556248 (10Arrbee) [08:05:09] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 TLS Redirect - string 'Wikipedia' not found on 'http://en.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 588 bytes in 0.002 second response time [08:06:18] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 TLS Redirect - string 'Wikipedia' not found on 'http://en.m.wikipedia.beta.wmflabs.org:80/wiki/Main_Page?debug=true' - 590 bytes in 0.015 second response time [08:53:28] RECOVERY - Puppet run on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [09:06:05] !log removing ores-related-cherry-picked commits from deployment-puppetmaster [09:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [09:15:47] * paladox is going to Scotland, left at 8am :) [10:42:08] 10Browser-Tests-Infrastructure, 10MediaWiki-extensions-MultimediaViewer, 06Reading-Web-Backlog, 13Patch-For-Review, and 5 others: A JSON text must at least contain two octets! (JSON::ParserError) in MultimediaViewer, Echo, Flow, RelatedArticles, MobileFront... - https://phabricator.wikimedia.org/T129483#2556663 [11:49:30] PROBLEM - Puppet run on deployment-cache-text04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:22:59] Yippee, build fixed! [12:22:59] Project selenium-GettingStarted » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #115: 09FIXED in 58 sec: https://integration.wikimedia.org/ci/job/selenium-GettingStarted/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/115/ [12:34:37] 03Scap3, 15User-mobrovac: Sequential execution should be per-deployment, not per-phase - https://phabricator.wikimedia.org/T142990#2556883 (10mobrovac) [12:34:39] 03Scap3 (Scap3-Adoption-Phase1), 10scap, 10Parsoid, 06Services, and 2 others: Deploy Parsoid with scap3 - https://phabricator.wikimedia.org/T120103#2556882 (10mobrovac) [13:10:55] 07Browser-Tests, 06Release-Engineering-Team, 07Documentation: Document browser tests ownership (and what it means) on wiki - https://phabricator.wikimedia.org/T142409#2556983 (10Tobi_WMDE_SW) [13:11:12] 07Browser-Tests, 06Release-Engineering-Team, 07Documentation: Document browser tests ownership (and what it means) on wiki - https://phabricator.wikimedia.org/T142409#2533979 (10Tobi_WMDE_SW) [13:47:16] RECOVERY - SSH on deployment-redis02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [13:50:20] PROBLEM - Puppet staleness on deployment-redis02 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [43200.0] [13:55:21] RECOVERY - Puppet staleness on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:10:59] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557175 (10chasemp) [14:28:15] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557202 (10chasemp) [14:29:34] 10Continuous-Integration-Infrastructure, 07Nodepool: 2016-08-10 CI incident follow-ups - https://phabricator.wikimedia.org/T142952#2557221 (10chasemp) [14:29:36] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: OpenStack misreports number of instances per project - https://phabricator.wikimedia.org/T143018#2557217 (10chasemp) 05Open>03declined Some parts of this are confused by the adhoc nature of reporting on our end, the usage command is i... [14:29:48] thcipriani: for your consideration https://phabricator.wikimedia.org/T143016 [14:29:55] let me know if that's too opaque as far as issues [14:32:37] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2557226 (10chasemp) p:05Triage>03High [14:33:55] Yippee, build fixed! [14:33:56] Project selenium-WikiLove » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #114: 09FIXED in 1 min 55 sec: https://integration.wikimedia.org/ci/job/selenium-WikiLove/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/114/ [14:43:55] chasemp: wowza, that's a lot of text :) [14:44:53] good overview of the issue from the nova-end though, thank you! [14:49:48] thcipriani: ok cool, trying to put emphasis on this isn't finger pointy at all on my end, but 25% operation error rate nodepool side alone is indication enough we need to figure out wtf is going on [14:50:17] I mean, that could be as designed but that seems weirdly unlikely [14:51:02] indeed. The change in timeout is interesting. I wonder if nodepool has some internal timeout that is getting tripped up by that change. [14:52:34] two timeouts thcipriani I find interesting are [14:52:36] boot-timeout: 120 [14:52:36] launch-timeout: 900 [14:52:47] from the example docs, I think we only set boot-timeout? [14:53:04] and also the 1s processing for build the only thing I can figure from surfing the code for a a few yesterday is the rate [14:53:11] which seems like the loop delay for ops [14:53:40] hmm, we have a boot-timeout of 300 and an api-timeout of 60 [14:53:58] api-timeout is mostly a red herring in that we are servicing the request [14:54:09] I think... [14:54:16] unless api-timeout is a ceiling for other op timeouts [14:54:20] I'm not entirely sure [14:55:27] another thing thcipriani I didn't note yet is I played a bit w/ the integrated reporting from nodepool [14:55:44] I think I showed you before but it's stuff like [14:55:44] https://graphite.wikimedia.org/render/?width=1010&height=471&_salt=1471359297.795&target=nodepool.launch.error.unknown.count&from=-3d [14:56:59] and if you look at errors by provider [14:57:05] and compare across instance creation errors [14:57:05] https://graphite.wikimedia.org/render/?width=1010&height=471&_salt=1471359368.044&target=cactiStyle(nodepool.launch.error.unknown.count)&target=cactiStyle(nodepool.launch.provider.wmflabs-eqiad.error.unknown.count)&from=-3d [14:57:09] pretty much a match [14:59:44] these are the number of instances that failed to launch with an unknown error? Is it weird that it never seems to go below 5? [15:00:23] we also have a delete-delay setting that is interesting in that it seems hashar wrote it: https://review.openstack.org/gitweb?p=openstack-infra/nodepool.git;a=commitdiff;h=7927dd189a31b36d3db9da4a6846e449affc0931 [15:01:17] yeah that's afaiu saying "delete immediately post test" [15:01:42] but it's not impossible that the nature of delay in spin up and tear down w/ this upstream is less agressive than our own [15:02:40] thcipriani: it does dip below 5 but sure hovers there [15:02:40] https://graphite.wikimedia.org/render/?width=1010&height=471&_salt=1471359297.795&target=nodepool.launch.error.unknown.count&from=-7d&from=-6d&lineMode=connected [15:03:13] got to SWAT, biab [15:03:16] later [15:03:20] just food for thought [15:04:07] yeah, I have a feeling it's probably going to take some source code spelunking to figure out what nodepool thinks it's doing in this instance. [15:05:24] yup [15:05:41] the code is fairly well written tho so that's a plus [15:17:01] (03PS1) 10Aude: Update Wikidata branch to wmf/1.28.0-wmf.15 [tools/release] - 10https://gerrit.wikimedia.org/r/305028 [15:35:34] (03CR) 10Aude: [C: 032] Update Wikidata branch to wmf/1.28.0-wmf.15 [tools/release] - 10https://gerrit.wikimedia.org/r/305028 (owner: 10Aude) [15:36:06] (03Merged) 10jenkins-bot: Update Wikidata branch to wmf/1.28.0-wmf.15 [tools/release] - 10https://gerrit.wikimedia.org/r/305028 (owner: 10Aude) [15:43:16] 10Continuous-Integration-Config, 10VisualEditor: VisualEditor-MediaWiki does not have the right CI tasks in REL1_27 - https://phabricator.wikimedia.org/T143117#2557419 (10Jdforrester-WMF) [16:15:02] 03Scap3, 15User-mobrovac: Sequential execution should be per-deployment, not per-phase - https://phabricator.wikimedia.org/T142990#2557560 (10mobrovac) >>! In T142990#2554038, @dduvall wrote: > We could make this easier by supporting a per-group parameter that internally splits the group by a given size, resul... [16:21:47] PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:22:10] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [17:47:22] 06Release-Engineering-Team, 15User-greg: Determine timing of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T137720#2557877 (10greg) [17:47:36] 06Release-Engineering-Team, 15User-greg, 15User-zeljkofilipin: Determine location of 2016 RelEng team offsite - https://phabricator.wikimedia.org/T137721#2557894 (10greg) [17:51:46] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [17:57:19] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [18:12:42] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:13:00] 10Beta-Cluster-Infrastructure, 06Operations: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2557968 (10greg) So.... remove `under_NDA`? [18:14:14] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T139217#2557988 (10greg) [18:14:16] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-extensions-CentralAuth, 06Reading-Web-Backlog: Unable to log in on https://commons.m.wikimedia.beta.wmflabs.org/wiki/Special:UserLogin - https://phabricator.wikimedia.org/T142015#2557986 (10greg) 05Open>... [18:15:31] 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07I18n: On Beta Cluster, MediaWiki namespace override is inconsistently applied - https://phabricator.wikimedia.org/T142863#2549208 (10greg) >>! In T142863#2549822, @Mattflaschen-WMF wrote: > Wasn't able to reproduce it in shell, so I restarted hhvm... [18:21:04] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:25:59] PROBLEM - Puppet run on deployment-eventlogging03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:43:37] (03CR) 10EBernhardson: "looks good! some minor comments" (034 comments) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/301335 (owner: 10Lethexie) [18:43:56] Lol /me in Scotland currently going past Edinburgh [18:51:48] what's funny? [18:51:52] oh he quit [19:11:52] 06Release-Engineering-Team, 06Developer-Relations, 06WMF-Legal, 07RfC: Remove @author lines from code - https://phabricator.wikimedia.org/T139301#2558245 (10ZhouZ) The GPL requires some sort of copyright notice for the source code. But if we want to, in general it should not be a problem to put the credit... [19:21:35] 10Continuous-Integration-Infrastructure, 06Labs, 10Labs-Infrastructure: Delete ci-trusty-wikimedia-278848 instance in contintcloud project - https://phabricator.wikimedia.org/T143058#2558277 (10chasemp) 05Open>03Resolved a:03chasemp I wish I knew why that was the case but I deleted from the CLI and I s... [19:22:42] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [19:26:04] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:35:57] RECOVERY - Puppet run on deployment-eventlogging03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:16:55] 10Continuous-Integration-Config, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM: wikimedia/fundraising/civicrm-buildkit repo needs to V+2 itself - https://phabricator.wikimedia.org/T142901#2550186 (10Eileenmcnaughton) Note all our changes are now in the upstream but it's hard to merge back in [20:28:43] 10Deployment-Systems, 03Scap3: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762#2558580 (10thcipriani) Hi @fgiunchedi I have a bugfix release for scap. Tagged `debian/3.2.3-1` and pushed up to the repo. This one should fix {T142792} and {T142364} Thanks for all your help! [20:37:50] 06Release-Engineering-Team, 10LDAP-Access-Requests, 06Operations, 10Ops-Access-Requests, and 3 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2558632 (10greg) > @RobH moved this task from Backlog to In Discussion on the Ops-Acce... [20:39:09] 06Release-Engineering-Team, 10LDAP-Access-Requests, 06Operations, 10Ops-Access-Requests, and 3 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2558655 (10RobH) Correct, this seems to be under discussion overall, not within the te... [20:50:45] Hey, is CI running out of trusty machines again? [20:53:43] Huh. Finally. [21:00:39] Can we do something about Phabricator processing unmerged commits? It seems rather unproductive to have it mention every iteration of a patch - https://phabricator.wikimedia.org/T141996 for example. [21:00:55] This seems a regression from before we replicated everything from Gerrit. [21:25:22] 10Beta-Cluster-Infrastructure, 06Commons, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-extensions-CentralAuth, 06Reading-Web-Backlog: Unable to log in on https://commons.m.wikimedia.beta.wmflabs.org/wiki/Special:UserLogin - https://phabricator.wikimedia.org/T142015#2558949 (10Tgr) Yeah. I te... [21:43:09] Krinkle: I'll look into it [21:44:18] Krinkle: relevant task: https://phabricator.wikimedia.org/T89940 [21:45:22] * greg-g is filing a task [21:51:11] o/ [21:51:24] Ah, I see. That makes sense. [21:53:08] Krinkle: still worth a task? :) [21:53:15] Yeah, I think so. [21:53:21] Assuming we want to change the behaviour. [21:54:26] it'd be nice if it could just notify once, but, it's already a hack since the purpose of it is to associate differentials, which is done intelligently (doesn't spam like this does) [21:55:14] Krinkle: https://phabricator.wikimedia.org/T143162#2559092 [21:55:55] * awjr waves [21:55:59] ahoy hoy [21:56:03] :D [21:56:25] anyone know if there is a way in maniphest to query for all child tasks of a specific parent task (both open and closed)? [21:56:40] i would like to bulk-edit a bunch of child tasks [21:56:51] ah.... hrmmm, not off hand [21:56:54] but i've been banging my head on the keyboard trying to figure this one out [21:57:14] at htis point, it would have been way faster to manually go through the tasks in the task graph and edit one-by-one [21:57:36] but i refuse to believe this is impossible. [21:57:53] relevant: https://xkcd.com/1205/ [21:58:07] lolyes [21:58:10] :) [21:58:17] but it's the principle of the thing, dammit [21:58:31] and btw how are you greg-g? it's been a while :D [21:58:54] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T140971#2559122 (10Jdforrester-WMF) [21:59:08] is zuul sad? I don't see any queue data on https://integration.wikimedia.org/zuul/ and have a commit that claims to have entered gate-and-submit 13 minutes ago but has not merged [21:59:38] bd808: 13 minutes is unsurprising right now. Most of my patches are taking ~20 minutes. [22:00:01] bd808: Hurting for boxes, I think. [22:00:26] I just noticed that Privacy Badger has marked mw.o as a tracker explaining my lack of vision on the queues [22:00:29] https://integration.wikimedia.org/ci/ shows running jobs on nodepool [22:00:52] not many/enough, but, yeah [22:01:13] (see also: bumping nodepool quota, but see also: nodepool screwing with openstack) [22:01:37] I have empathy for both sides of that one [22:01:53] I don't think anyone in the current Labs team understood what nodepool was going to do [22:05:16] 06Release-Engineering-Team (Deployment-Blockers), 10MobileFrontend, 06Reading-Web-Backlog, 03Reading-Web-Sprint-79-Uh-oh, and 2 others: The "Disclaimers" link in Special:MobileMenu is misaligned - https://phabricator.wikimedia.org/T143066#2559167 (10Jdlrobson) p:05High>03Unbreak! It looks like the cont... [22:05:30] 06Release-Engineering-Team (Deployment-Blockers), 10MobileFrontend, 06Reading-Web-Backlog, 03Reading-Web-Sprint-79-Uh-oh, and 2 others: MediaWiki:Common.css loaded on mobile instead of MediaWiki:Mobile.css - https://phabricator.wikimedia.org/T143066#2559175 (10Jdlrobson) [22:08:53] awjr: to answer your question: I'm alright, but the usual fire fighting mentality, sadly [22:09:55] * awjr hands greg-g some gasoline [22:10:00] er [22:10:17] * awjr hands greg-g some water [22:11:46] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T140971#2559202 (10greg) [22:11:49] 06Release-Engineering-Team (Deployment-Blockers), 10MobileFrontend, 06Reading-Web-Backlog, 03Reading-Web-Sprint-79-Uh-oh, and 2 others: MediaWiki:Common.css loaded on mobile instead of MediaWiki:Mobile.css - https://phabricator.wikimedia.org/T143066#2559203 (10greg) [22:15:14] awjr: I'll take the first as well, there are indeed things which we should burn as well [22:15:30] also, "yay", two new deploy blockers: https://phabricator.wikimedia.org/T140971 [22:20:05] greg-g: :D [22:20:41] to burning things, that is, not so much to the deploy blockers. [22:23:40] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T140971#2559248 (10greg) [22:23:44] now up to 3 [22:37:57] !log restarting nodepool, bumping max_servers to match up with what openstack seems willing to allocate (6) [22:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [22:38:47] s/bumping/dropping/ ;) [22:39:55] potato tomato [22:40:07] :) :) [23:00:55] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T140971#2559381 (10Yurik) [23:01:14] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559382 (10thcipriani) I dropped the `max-servers` in `/etc/nodepool/nodepool.yaml` to 6 as that seemed to be the max number of allocated ins... [23:08:39] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559429 (10thcipriani) hrm. Maybe I stopped seeing 403s since demand was lower. Still working with 6 instances, just got: ``` Forbidden: Quo... [23:10:41] !log max_servers at 6, seeing 6 allocated instances, still seeing 403 already used 10 of 10 instances :(( [23:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [23:11:51] wth [23:12:05] 06Release-Engineering-Team (Deployment-Blockers), 05Release: MW-1.28.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T140971#2559440 (10Jdforrester-WMF) [23:12:50] 10Browser-Tests-Infrastructure, 10MediaWiki-extensions-MultimediaViewer, 06Reading-Web-Backlog, 13Patch-For-Review, and 4 others: A JSON text must at least contain two octets! (JSON::ParserError) in MultimediaViewer, Echo, Flow, RelatedArticles, MobileFront... - https://phabricator.wikimedia.org/T129483#2559442 [23:13:53] 10Continuous-Integration-Infrastructure, 06Labs, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2559446 (10thcipriani) Messages like this one: ``` DEBUG nodepool.NodePool: Deleting node id: 261807 which has been in building state for 0.... [23:50:48] (03CR) 10Gergő Tisza: "On reflection not very useful for CI debugging as it outputs passwords and session cookies, so this should never be enabled in Jenkins. S" [selenium] - 10https://gerrit.wikimedia.org/r/304332 (https://phabricator.wikimedia.org/T142600) (owner: 10Gergő Tisza)