[02:09:01] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<50.00%) [02:25:39] 10Gerrit, 10Operations, 10Ops-Access-Requests: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3790664 (10Legoktm) I think there's probably other ways we can help Paladox contribute in this area that don't require the nda access - is this just about wanting to vi... [02:43:39] Project beta-scap-eqiad build #183878: 04FAILURE in 0.36 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/183878/ [02:56:57] Yippee, build fixed! [02:56:57] Project beta-scap-eqiad build #183879: 09FIXED in 3 min 16 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/183879/ [03:47:22] 10Gerrit, 10Operations, 10Ops-Access-Requests: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791246 (10demon) >>! In T181446#3791114, @Legoktm wrote: > I think there's probably other ways we can help Paladox contribute in this area that don't require the nda a... [03:59:22] Project selenium-MultimediaViewer » firefox,mediawiki,Linux,BrowserTests build #591: 04FAILURE in 3 min 21 sec: https://integration.wikimedia.org/ci/job/selenium-MultimediaViewer/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=mediawiki,PLATFORM=Linux,label=BrowserTests/591/ [04:10:42] 10Gerrit, 10Operations, 10Ops-Access-Requests: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791269 (10Dzahn) Confirmed. This was about Gerrit logs. If there was a way to request "logstash but just Gerrit" then that would have been the request. Paladox is the... [06:36:47] 10Gerrit, 10Operations, 10Ops-Access-Requests: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791403 (10greg) 05Open>03declined Thank you, @Paladox, for all of your help with this effort (and more) with Gerrit. Unfortunately, per @faidon I'm declining this... [06:44:02] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [06:46:30] Yippee, build fixed! [06:46:30] Project selenium-Wikibase » chrome,beta,Linux,BrowserTests build #558: 09FIXED in 2 hr 6 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/558/ [07:28:01] 10Gerrit, 10Operations, 10Ops-Access-Requests: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3791417 (10Legoktm) Do we have alerts for Gerrit exceptions? If we could set up IRC alerts (or something else), paladox could follow those and have someone with logstas... [07:35:01] PROBLEM - Free space - all mounts on integration-slave-jessie-1001 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1001.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1001.diskspace._srv.byte_percentfree (<10.00%) [08:14:59] PROBLEM - Free space - all mounts on integration-slave-jessie-1001 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1001.diskspace._mnt.byte_percentfree (No valid datapoints found)integration.integration-slave-jessie-1001.diskspace._srv.byte_percentfree (<50.00%) [08:33:17] 10Continuous-Integration-Infrastructure (shipyard), 10monitoring, 10Graphite: Purge labs graphite metrics of Docker ephemeral partitions - https://phabricator.wikimedia.org/T181476#3791555 (10hashar) [08:33:33] 10Continuous-Integration-Infrastructure (shipyard), 10monitoring, 10Graphite: Purge labs graphite metrics of Docker ephemeral partitions - https://phabricator.wikimedia.org/T181476#3791569 (10hashar) [08:33:38] 10Continuous-Integration-Infrastructure (shipyard), 10Cloud-Services, 10Operations, 10monitoring, and 3 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3645705 (10hashar) [08:33:59] 10Continuous-Integration-Infrastructure (shipyard), 10monitoring, 10Graphite: Purge labs graphite metrics of Docker ephemeral partitions - https://phabricator.wikimedia.org/T181476#3791555 (10hashar) That first need Diamond to stop collecting the ephemeral mounts which is T177052. [08:37:19] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Operations, 10Release Pipeline, and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3791577 (10hashar) >>! In T178454#3789255, @Dzahn wrote:... [09:10:14] PROBLEM - Free space - all mounts on integration-slave-docker-1001 is CRITICAL: CRITICAL: integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_e6e01440f04a145c09affed5a3ca6f3b53daaab53d038a0d9a25a2321ad2e83e.byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_ae0c0365a418e6d09fafa87f9895fd4e627f98827599ba01651a8d2bb7 [09:10:14] e (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_542702f60914ef9f11b4fff9801fb50b98babed3d88dab01384772c8a8f1f043.byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_382de2b48f7a102ecf17cbd12d459f88057d36dba96cedb922040e35e9b82d2e.byte_percentfree (No valid datapoints found) integrat [09:10:14] docker-1001.diskspace._var_lib_docker_devicemapper_mnt_244a2d354369e401b6c66d4bd544048c3840d932cf778fdf984537361725f816.byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_d8fc462aeb76074a5f89517c31b54a82a6b9bee7edadd84f909f7bf192acd160.byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_ [09:10:14] a7f3437c5df7fe9022aa85e5b8d3f8456090e82a1199b2b29dec9635842.byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_1c0fb41f7a2181b123e5946fba45e85cdfef2556fc8597cb15f65675a1b19486.byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_eed8308ecda3c515fc3eb746da906634666ed3ec1b4 [09:10:14] byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_31efaafd7825c53306a17741dcce1df6a28a339aadc13d4aa228986a0325c8a7.byte_percentfree (No valid datapoints found) integration.integration-slave-docker-1001.diskspace._var_lib_docker_devicemapper_mnt_734cb090a18ec3f54bb70110faa369e4073fe6ad423521c364e403698f3ec8b1.byte_percentfree (No valid datapoints [09:13:56] (03PS1) 10Hashar: Get rid of class in test_zuul_coverage [integration/config] - 10https://gerrit.wikimedia.org/r/393731 [09:15:05] (03CR) 10jerkins-bot: [V: 04-1] Get rid of class in test_zuul_coverage [integration/config] - 10https://gerrit.wikimedia.org/r/393731 (owner: 10Hashar) [09:16:21] (03PS2) 10Hashar: Get rid of class in test_zuul_coverage [integration/config] - 10https://gerrit.wikimedia.org/r/393731 [09:18:02] (03CR) 10Hashar: [C: 032] Get rid of class in test_zuul_coverage [integration/config] - 10https://gerrit.wikimedia.org/r/393731 (owner: 10Hashar) [09:19:24] (03Merged) 10jenkins-bot: Get rid of class in test_zuul_coverage [integration/config] - 10https://gerrit.wikimedia.org/r/393731 (owner: 10Hashar) [10:50:01] PROBLEM - Free space - all mounts on deployment-videoscaler01 is CRITICAL: CRITICAL: deployment-prep.deployment-videoscaler01.diskspace.root.byte_percentfree (<11.11%) [11:00:01] RECOVERY - Free space - all mounts on deployment-videoscaler01 is OK: OK: All targets OK [11:04:30] (03PS1) 10Hashar: Zuul coverage convert to multiple tests [integration/config] - 10https://gerrit.wikimedia.org/r/393744 [11:16:59] (03CR) 10Zfilipin: [C: 04-1] "Note: mediawiki-core-qunit-selenium-jessie runs both qunit and selenium tests." (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/393642 (https://phabricator.wikimedia.org/T181429) (owner: 10Jdlrobson) [11:53:07] (03PS2) 10Hashar: Zuul coverage convert to multiple tests [integration/config] - 10https://gerrit.wikimedia.org/r/393744 [11:54:22] (03CR) 10jerkins-bot: [V: 04-1] Zuul coverage convert to multiple tests [integration/config] - 10https://gerrit.wikimedia.org/r/393744 (owner: 10Hashar) [11:55:08] (03PS3) 10Hashar: Zuul coverage convert to multiple tests [integration/config] - 10https://gerrit.wikimedia.org/r/393744 [12:03:03] 10Gerrit, 10Upstream: Allow searching for 'state:active', 'state:read_only', 'state:hidden' via web interface - https://phabricator.wikimedia.org/T180297#3792141 (10Paladox) Upstream added support for this today, see https://gerrit-review.googlesource.com/#/c/gerrit/+/144951/ . Though it dosen't add it to the... [12:18:29] (03PS4) 10Hashar: Zuul coverage convert to multiple tests [integration/config] - 10https://gerrit.wikimedia.org/r/393744 [12:25:37] 10Gerrit, 10Upstream: Allow searching for 'state:active', 'state:read_only', 'state:hidden' via web interface - https://phabricator.wikimedia.org/T180297#3753179 (10hashar) Meanwhile you can ask the REST API to give you all the projects with description and state with: curl 'https://gerrit.wikimedia.org/r... [12:25:59] (03CR) 10Hashar: [C: 032] "Lets give it a try" [integration/config] - 10https://gerrit.wikimedia.org/r/393744 (owner: 10Hashar) [12:27:03] (03Merged) 10jenkins-bot: Zuul coverage convert to multiple tests [integration/config] - 10https://gerrit.wikimedia.org/r/393744 (owner: 10Hashar) [13:22:30] (03PS1) 10Hashar: Refactor test_all_extensions_have_gate_and_submit [integration/config] - 10https://gerrit.wikimedia.org/r/393759 [13:23:11] (03CR) 10Hashar: [C: 032] Refactor test_all_extensions_have_gate_and_submit [integration/config] - 10https://gerrit.wikimedia.org/r/393759 (owner: 10Hashar) [13:24:15] (03Merged) 10jenkins-bot: Refactor test_all_extensions_have_gate_and_submit [integration/config] - 10https://gerrit.wikimedia.org/r/393759 (owner: 10Hashar) [13:59:27] PROBLEM - Puppet errors on deployment-tin is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:21:24] 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban), 10Patch-For-Review: When jenkins kills a build due to max execution time the docker containers stay running - https://phabricator.wikimedia.org/T176747#3792536 (10hashar) [14:39:25] RECOVERY - Puppet errors on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [15:15:01] PROBLEM - Puppet errors on deployment-redis06 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [15:16:35] 10Release-Engineering-Team (Kanban), 10Scap (Tech Debt Sprint FY201718-Q2): Scap failing to rewrite submodule urls in beta - https://phabricator.wikimedia.org/T179013#3792674 (10awight) 05Open>03Resolved Thanks, confirmed that the our submodule URLs are all pointing to tin.wmflabs now! [15:25:05] 10Release-Engineering-Team (Watching / External), 10Epic, 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017): Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733#3792679 (10jcrespo) If someone has time, please give an update to https://meta.wikimedia.org/wiki/Community_Tech/Edit_summa... [15:27:44] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:28:16] Any idea why Jenkins didn't run for any of the tasks in this search? https://gerrit.wikimedia.org/r/#/q/topic:bug/T171797-ParserOutput-stateless-transforms+OR+topic:bug/T160563-deduplicateStyles [15:31:05] 10Release-Engineering-Team (Watching / External), 10Epic, 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017): Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733#3792704 (10Anomie) I'll update it. [15:32:20] hashar ^^ [15:32:25] hasharAway ^^ [15:32:51] im guessing this is because we use docker now? and the tests were not updated to docker for wikibase? [15:33:42] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:39:07] hmm never mind. it is working for amir. but dosent seem to work for anomi.e changes. [15:43:36] anomie i think it may be the way you did the depends-on line. (not really sure) but it works for everyone else. So it must be that. [15:43:50] depends-on has to be above the change-id [15:44:49] paladox: That would be new. And it's happening even for changes in there with no Depends-On at all. [15:45:16] anomie are the changes that doint have Depnds-On that depend on changes that do? [15:46:26] hmm i see https://gerrit.wikimedia.org/r/#/c/393260/2 [15:47:00] anomie it may be because of the long topic [15:50:12] 10Release-Engineering-Team (Watching / External), 10Global-Collaboration, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), and 3 others: Make ORES-consuming pages more robust to ORES errors - https://phabricator.wikimedia.org/T181191#3792785 (10awight) Confirmed... [15:50:21] 10Release-Engineering-Team (Watching / External), 10Operations, 10Release Pipeline: Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3792786 (10thcipriani) 05Open>03Resolved [15:52:23] 10Release-Engineering-Team (Watching / External), 10Global-Collaboration, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), and 3 others: Make ORES-consuming pages more robust to ORES errors - https://phabricator.wikimedia.org/T181191#3792796 (10awight) [15:56:54] 10Release-Engineering-Team (Watching / External), 10Operations, 10Patch-For-Review, 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3792806 (10awight) [16:06:07] (03CR) 10Jdlrobson: "Yes they are failing. That's why I want to make this visible. The failing qunit tests will be fixed soon but id like to get this merged fi" [integration/config] - 10https://gerrit.wikimedia.org/r/393642 (https://phabricator.wikimedia.org/T181429) (owner: 10Jdlrobson) [16:12:45] paladox: Your "long topic" theory seems wrong too, tests on https://gerrit.wikimedia.org/r/#/c/393790/ seem to be running. [16:13:06] hmm, i wonder what it could be then. [16:31:41] no_justification hi, i am testing gelf upstream in gerrit :). I found udp does not really work. But the logstash gelf plugin dosent support tcp, but there's a pull. which allowed me to use tcp. It included the hostname too. [16:32:41] 10Continuous-Integration-Config, 10Release-Engineering-Team (Kanban), 10Discovery, 10Wikimedia-Portals, and 2 others: Create a Jenkins Job that builds the portal deployment artifacts in CI - https://phabricator.wikimedia.org/T179694#3792995 (10debt) Hi @hashar - did the email get setup? I don't think I rec... [16:36:34] 10Continuous-Integration-Config, 10Release-Engineering-Team (Kanban), 10Discovery, 10Wikimedia-Portals, and 2 others: Create a Jenkins Job that builds the portal deployment artifacts in CI - https://phabricator.wikimedia.org/T179694#3793019 (10debt) >>! In T179694#3792995, @debt wrote: > Hi @hashar - did t... [16:54:22] 10Continuous-Integration-Config, 10Discovery, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: CI tests on wikimedia/portals repo: cache node_modules to save time - https://phabricator.wikimedia.org/T152386#3793135 (10debt) [16:55:54] 10Continuous-Integration-Config, 10Release-Engineering-Team (Kanban), 10Discovery, 10Wikimedia-Portals, and 2 others: Create a Jenkins Job that builds the portal deployment artifacts in CI - https://phabricator.wikimedia.org/T179694#3793144 (10debt) 05Open>03Resolved [16:56:21] 10Release-Engineering-Team (Kanban), 10Discovery-Portal-Sprint: Create a dedicated deployment window for portal deployments - https://phabricator.wikimedia.org/T180401#3793147 (10debt) 05Open>03Resolved a:03debt [17:02:33] 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3776331 (10Halfak) https://wikitech.wikimedia.org/wiki/Incident_document... [17:04:02] 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793193 (10awight) [17:05:08] really, phpcs let "$headers = $this->getCustomHeaders( $params ['headers'] );" be allowed? (space before '['). I've been getting phpcs complaining left and right about my commits, like an extra step always there. [17:06:43] 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793195 (10greg) @halfak: I might be wrong, but it looks like not all of... [17:09:31] 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793201 (10awight) @greg Good point—I'd like to keep this parent task ac... [17:11:32] 10Release-Engineering-Team (Watching / External), 10Global-Collaboration, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), and 3 others: Make ORES-consuming pages more robust to ORES errors - https://phabricator.wikimedia.org/T181191#3793237 (10awight) [17:21:23] haha AaronSchulz that is evil [17:21:48] 10Scap, 10Scoring-platform-team (Current): Deployment fails with something gitty - https://phabricator.wikimedia.org/T179987#3793287 (10awight) [17:23:35] no_justification: could you ping me just before you cut the branches? :) [17:30:04] AaronSchulz: file a bug plz [17:31:30] !log Remove beta cluster customizations for ORES [17:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:33:13] Sorry, trial-and-erroring stashbot. [17:33:21] !log beta Remove beta cluster customizations for ORES [17:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:37:07] addshore: probably start in about 20 minutes [17:37:28] no_justification: okay! [17:37:33] *makes another wikidata build* [17:40:09] woo! [17:41:28] 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793388 (10awight) [17:41:32] 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: ORES beta cluster config should be as close to production as possible - https://phabricator.wikimedia.org/T181187#3782311 (10awight) 05Open>03Resolved [17:47:33] addshore: So we got a little off track with our killing of that build with all of the wmf.8 shenanigans. Think we can resume this week? [17:48:23] So, I was planning on finishing it up once .10 was rolled out everywhere [17:48:37] i scheduled some slots for monday and tuesday next week to do everything that remains [17:49:05] I want to wait for .10 everywhere as one i start doing the next bits we will not be able to rollback to .7 [17:49:19] as the config will be incompatible [17:50:21] PROBLEM - Puppet errors on integration-slave-jessie-1002 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [17:50:46] PROBLEM - Puppet errors on integration-slave-jessie-1003 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [17:52:13] addshore: Ok great! Lemme know when you're done with your build and I'll trigger the wmf.10 build [17:52:32] no_justification: build done and merged! [17:52:39] branch branch branch ! [17:56:42] no_justification: I could probably keep the config around in case we do need to rollback to .7, but I think thats worry a bit too much [17:58:33] I feel like either it'll work or it won't :) [17:59:00] oh, it will :) [17:59:10] And then there will be cake [18:00:03] nom nom nom [18:03:12] 10Gerrit, 10Operations, 10Ops-Access-Requests: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3793536 (10demon) >>! In T181446#3791417, @Legoktm wrote: > Do we have alerts for Gerrit exceptions? If we could set up IRC alerts (or something else), paladox could fo... [18:06:11] PROBLEM - Puppet errors on integration-slave-jessie-1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:06:46] (03PS2) 10Jdlrobson: Run QUnit tests on Minerva skin non-experimental [integration/config] - 10https://gerrit.wikimedia.org/r/393642 (https://phabricator.wikimedia.org/T181429) [18:08:27] (03PS3) 10Jdlrobson: Run QUnit tests on Minerva skin non-experimental [integration/config] - 10https://gerrit.wikimedia.org/r/393642 (https://phabricator.wikimedia.org/T181429) [18:15:29] PROBLEM - Puppet errors on deployment-ms-be03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:21:26] PROBLEM - Puppet errors on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:39:22] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:50:32] thanks no_justification :) [18:51:41] Easy merge :) [18:52:01] :) [18:54:54] PROBLEM - Puppet errors on deployment-ms-be04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:05:21] 10Release-Engineering-Team, 10ORES, 10Operations, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:05:56] heyas rel-eng, do i need to followt he project creation guidelines and make a task when im making the project myself and its just another ops-datacenter site project? [19:06:10] (im going to assume i do but also note on task ill also make it) [19:06:59] 10Release-Engineering-Team, 10ORES, 10Operations, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:08:21] (meh, nm instructions say i can just create should have read before asking) [19:09:51] 10Release-Engineering-Team, 10ORES, 10Operations, 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793924 (10awight) This could be another submodule rewriting problem. I see that .gitmodules has been modified on the ta... [19:14:01] robh: :) [19:18:30] bleh. global heralds, i still have to login as admin to make it properly identical to the other sub-projects [19:20:24] i recall there being a command to give a url string to login... [19:21:25] Amir1: I believe that’s no_justification’s diff to wikiversions.json? [19:26:56] and my user had the rights i didnt realize it had. [19:27:03] both pleasing and mildly frightening =] [19:43:46] (03CR) 10Jdlrobson: "> mediawiki-core-qunit-selenium-jessie runs both qunit and selenium tests." [integration/config] - 10https://gerrit.wikimedia.org/r/393642 (https://phabricator.wikimedia.org/T181429) (owner: 10Jdlrobson) [19:53:48] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [10.0] [20:21:53] greg-g: Mind if we grab a hotfix window after the train is deployed? [20:26:38] Project selenium-Wikibase-chrome » chrome,beta,Linux,DebianJessie && contintLabsSlave build #26: 04FAILURE in 39 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase-chrome/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=DebianJessie%20&&%20contintLabsSlave/26/ [20:31:54] greg-g: testwiki wmf.10 seems very slow - loading pages, any action on a page. I checked my connection-it's fast. It's just me or there is something? [20:32:42] greg-g: to the point that I got a timeout displayed on a page saving an edit [20:33:24] wmf.10 is slow cuz it's only on testwiki [20:33:29] So caches are pretty cold [20:33:38] Once I roll to rest of group0 it'll speed up [20:34:01] (basically, each apache needs to build its wmf.10 bytecache) [20:37:50] PROBLEM - Free space - all mounts on integration-slave-jessie-1003 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1003.diskspace._srv.byte_percentfree (<11.11%) [20:38:52] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [20:39:47] greg-g: nvm about the hotfix, I’m just going to lay on a cold floor instead :D [20:44:12] bd808: https://tools.wmflabs.org/versions/ group2 doesn't know what version it's on :p [20:44:45] yikes! [20:45:20] * bd808 looks to see how that is supposed to work [20:45:34] awight: eek, sorry, was eating lunch, everything ok? [20:45:36] I added all of closed.dblist to group0.dblist. Might have confused it [20:46:23] greg-g: We noticed a stupid bug that causes an ORES stampede when the service goes down, but all have PTSD from the… service going down repeatedly today. So I’ll pass on the macho deployment today <3 [20:46:38] no_justification: hmmm... yeah something has gone nuts [20:46:55] greg-g: Might do it in tomorrow’s window, or even better just let it go out with the train on Thursday. [20:47:12] +1 for laying on the cold floor [20:47:17] lolol [20:47:22] window fail [20:47:31] Doesn't sound fun :) [20:47:34] We were comparing ways to play dead for the afternoon. [20:47:50] no_justification: was that because of hardcoded aawiki things? [20:48:01] Oooh, good point [20:48:52] Wonder if aawiki shouldn't be group0 [20:49:05] stupid aawiki [20:49:54] Unrelated, why is aawiki in wikipedia-english.dblist? [20:51:19] my home grown logic for handling the %% lines in the dblist files must be messed up I guess [20:51:21] rofl [20:51:29] in the versions tool [20:51:42] its putting all the things into group0 + group1 [20:55:51] Removing from aawiki, just to be safe. [20:56:04] Removing aawiki from group0 [20:56:05] That is [21:03:44] ah... its a funny bug [21:04:00] I create the group2 list with an array_diff [21:04:12] and then I grab $group2[0] [21:04:30] with aawiki in group0, $group2[0] ends up being unset [21:04:51] because array_diff() doesn't renumber the array slots [21:05:03] sneaky [21:13:11] no_justification: https://tools.wmflabs.org/versions/ is all better. The fix was https://phabricator.wikimedia.org/R1958:99b7725f1a3c2fc69383c84eec7f6acdfc2c9009 [21:13:18] and it was tripped up by aawiki :) [21:16:00] bd808: Coolio. I also moved aawiki back to group2 just out of precaution too. Other weird things might rely on it not being group0 [21:17:38] I seem to remember something getting very freaked out by aawiki at some point, but I hope that was scap1 code [22:00:01] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<30.00%) [22:08:43] legoktm / etc.: Don't suppose you feel like making a bot to bump our trivial-but-with-security-warnings Ruby CI dependency chain? https://gerrit.wikimedia.org/r/#/c/393918/ is OK but I'm not planning to make 100 manual commits like that. [22:11:01] Jees :/ [22:11:35] That's a maintenance nightmare [22:11:51] Yeah. And all for one file (the jsduck custom tags config file). [22:12:24] Ideally we'd bin jsduck for something not based on Ruby, as we don't use that. :-) [22:13:16] Migration to node stuff is underway iirc? [22:13:27] It is? [22:13:45] Or planned? [22:13:46] Meh [22:13:52] https://phabricator.wikimedia.org/T138401 [22:14:51] https://phabricator.wikimedia.org/T139740 [22:21:48] no_justification: thanks. git br -a now fits on my screen without scrolling! [22:22:12] Heh, I got yelled at for overzealously killing branches! [22:23:54] Reedy: Yeah, that's selenium testing, not jsduck. [22:38:42] I somehow find it hysterical that ruby is used to produce js documentation [22:42:58] heh] [22:58:21] 10Continuous-Integration-Infrastructure: Jenkins is mysteriously not running for some changes - https://phabricator.wikimedia.org/T181574#3794756 (10Anomie) [22:59:01] greg-g: FYI I’m shooting for a hotfix window again. [23:03:11] I filed T181574 for the "Jenkins is ignoring my patches" issue I mentioned in here earlier. [23:03:11] T181574: Jenkins is mysteriously not running for some changes - https://phabricator.wikimedia.org/T181574 [23:09:35] James_F: can you file a task for the ruby stuff in for libraryupgrader? And if you can list the exact steps you did to make the commit I can give it a shot [23:10:16] legoktm: Sure. [23:10:29] 10Continuous-Integration-Infrastructure: Jenkins is mysteriously not running for some changes - https://phabricator.wikimedia.org/T181574#3794756 (10thcipriani) According to the zuul debug log there is a dependency cycle that it doesn't like: ``` 2017-11-28 16:19:02,766 DEBUG zuul.source.Gerrit: Updating 10Continuous-Integration-Infrastructure: Jenkins is mysteriously not running for some changes - https://phabricator.wikimedia.org/T181574#3794756 (10Legoktm) There's a circular loop in depends-on. https://gerrit.wikimedia.org/r/#/c/393261/ depends on https://gerrit.wikimedia.org/r/#/c/393279/ which depends back... [23:12:58] 10Continuous-Integration-Infrastructure: Jenkins is mysteriously not running for some changes - https://phabricator.wikimedia.org/T181574#3794796 (10Anomie) It'd be nice if it told me that. And why does it affect https://gerrit.wikimedia.org/r/#/c/393260/? [23:16:15] 10Continuous-Integration-Config, 10VPS-project-libraryupgrader: Make a bot to bump our trivial-but-with-security-warnings Ruby CI dependency chain - https://phabricator.wikimedia.org/T181576#3794818 (10Jdforrester-WMF) [23:16:19] 10Continuous-Integration-Config, 10VPS-project-libraryupgrader: Make a bot to bump our trivial-but-with-security-warnings Ruby CI dependency chain - https://phabricator.wikimedia.org/T181576#3794844 (10Jdforrester-WMF) I'm sure a better Ruby guru (pretty much anyone) could turn my comments into a better set of... [23:16:41] 10Continuous-Integration-Config, 10VPS-project-libraryupgrader: Make a bot to bump our trivial-but-with-security-warnings Ruby CI dependency chain - https://phabricator.wikimedia.org/T181576#3794847 (10Jdforrester-WMF) [23:16:55] 10Continuous-Integration-Config, 10VPS-project-libraryupgrader: Make a bot to bump our trivial-but-with-security-warnings Ruby CI dependency chain - https://phabricator.wikimedia.org/T181576#3794818 (10Jdforrester-WMF) [23:16:55] Done. [23:18:43] PROBLEM - Puppet errors on deployment-zotero01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:20:21] 10Continuous-Integration-Config, 10VPS-project-libraryupgrader: Make a bot to bump our trivial-but-with-security-warnings Ruby CI dependency chain - https://phabricator.wikimedia.org/T181576#3794859 (10Legoktm) [23:24:13] awight: I'm in a meeting and not clear on the situation (wasn't it the scb servers that were the issue? why is ORES upgrading?) [23:24:49] PROBLEM - Puppet errors on deployment-urldownloader is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:24:57] greg-g, will give you an overview of what we learned. [23:25:08] greg-g: Sorry—So we discovered that the extension has a cache stampede when the service is down. [23:25:43] It’s making a difficult overload situation much worse, by c. 400 unnecessary requests/second [23:25:44] MW has some code to ask ORES fore some meta-data. That meta-data request has gone totally out of control. SCB problems were independent of this. This code change will throttle it. [23:26:22] I have an Extension:ORES hotfix for the wmf.8 branch which should cache failures for 1minute to give us some breathing space. [23:26:37] Fun story. We (akosiaris) accidentally ran an experiment by breaking ORES config and we were able to show that the issues stop when we stop the stampede. [23:27:01] So we're not waving around wildly hoping this will help. [23:27:33] doit [23:27:35] * awight also tries to straighten my tie [23:30:20] Thanks greg-g :) [23:30:57] PROBLEM - Puppet errors on deployment-kafka-jumbo-2 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:33:48] PROBLEM - Puppet errors on deployment-eventlog02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:34:02] PROBLEM - Puppet errors on deployment-tmh01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:36:10] PROBLEM - Puppet errors on deployment-netbox is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:37:25] PROBLEM - Puppet errors on deployment-kafka-jumbo-1 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:39:58] 10Release-Engineering-Team, 10User-greg: Create #wikimedia-releng-feed and move bots there - https://phabricator.wikimedia.org/T181582#3794965 (10greg) [23:44:09] * halfak and awight continue discussions with legoktm about deploying this change.