[05:06:47] PROBLEM - SSH on deployment-lucid-salt is CRITICAL: Connection refused [05:06:47] PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:07:04] as you can see [05:07:07] shinken’s back up :) [05:08:50] PROBLEM - Puppet failure on deployment-mathoid is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:10:32] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:11:21] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:11:23] PROBLEM - Puppet failure on deployment-sca01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:11:39] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<100.00%) [06:36:38] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [08:49:01] 10Continuous-Integration, 10Gather: Gather should be using its own Gruntfile in Jenkins - https://phabricator.wikimedia.org/T92589#1227033 (10hashar) [08:54:26] 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1227041 (10phuedx) 5Open>3Resolved All yer patches are merged @Jdlrobson. [08:54:28] 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1227043 (10phuedx) [09:00:02] 10Continuous-Integration, 10Gather: Gather should be using its own Gruntfile in Jenkins - https://phabricator.wikimedia.org/T92589#1227050 (10hashar) @Jdlrobson wrote: > PS. @hashar we really need to make these jobs something that developers get for free when they setup an extension. Maybe this is something we... [09:11:28] 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1227055 (10hashar) Logged in with my LDAP account, I have manually triggered runs for the three... [09:37:34] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 3 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1227102 (10hashar) I have poked @fgiunchedi about the Trusty packages. Rebuild it out of the integration/zuul.git deb... [10:53:57] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Blocked-on-Operations, and 3 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1227205 (10fgiunchedi) 5Open>3Resolved {{done}} [10:53:59] 10Continuous-Integration: Zuul: upgrade to latest upstream version - https://phabricator.wikimedia.org/T48354#1227209 (10fgiunchedi) [10:54:01] 10Continuous-Integration: Upgrade Zuul server to latest upstream - https://phabricator.wikimedia.org/T94409#1227208 (10fgiunchedi) [10:54:03] 10Continuous-Integration, 7Zuul: Zuul: python git assert error assert len(fetch_info_lines) == len(fetch_head_info) - https://phabricator.wikimedia.org/T61991#1227207 (10fgiunchedi) [11:06:19] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [11:13:50] RECOVERY - Puppet failure on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [11:16:25] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:20:29] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [11:31:15] !log integration: Zuul package has been uploaded for Trusty! Deleting the .deb from /home/hashar/ [11:31:18] Logged the message, Master [11:34:06] !log integration: apt-get upgrade on integration-slave-trusty* instances [11:34:09] Logged the message, Master [11:40:16] PROBLEM - Puppet failure on integration-slave-trusty-1014 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:50:14] RECOVERY - Puppet failure on integration-slave-trusty-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [12:48:04] !log beta: Andrew B. starting to migrate beta cluster instances on new virt servers [12:48:07] Logged the message, Master [12:54:56] 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1227342 (10hashar) 5Resolved>3Open Reopening since the builds I triggered earlier have some... [12:54:58] 10Browser-Tests, 10Continuous-Integration, 7Tracking: Fix or delete browsertests* Jenkins jobs that are failing for more than a week (tracking) - https://phabricator.wikimedia.org/T94150#1227344 (10hashar) [12:59:40] 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227356 (10hashar) a:3hashar Created instance i-00000b91 m1.small with image "ubuntu-12.04-precise" and integration-saltmaster.eqiad.wmflabs. https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000... [13:12:33] (03PS3) 10Hashar: Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [13:18:24] PROBLEM - Host deployment-bastion is DOWN: PING CRITICAL - Packet loss = 100% [13:18:50] RECOVERY - Host deployment-bastion is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [13:23:38] (03PS4) 10Hashar: Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [13:25:22] (03CR) 10Hashar: "That needs DOC_SUBPATH to be set which is done only if the job name is suffixed with '-publish'." (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [13:25:27] (03CR) 10Hashar: [C: 032] Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [13:27:26] (03Merged) 10jenkins-bot: Convert 'mediawiki-vagrant-puppet-doc' job to run on a labs slave [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [13:34:26] (03CR) 10Hashar: "So the publish job worked just fine https://integration.wikimedia.org/ci/job/mediawiki-vagrant-puppet-doc-publish/642/" [integration/config] - 10https://gerrit.wikimedia.org/r/204980 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [13:35:11] 10Continuous-Integration, 5Patch-For-Review: Migrate all jobs to labs slaves - https://phabricator.wikimedia.org/T86659#1227420 (10hashar) [13:38:24] (03CR) 10Hashar: [C: 04-1] "Suffix the job name with '-publish' and that will be good :)" (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/204982 (https://phabricator.wikimedia.org/T86659) (owner: 10Legoktm) [13:38:28] PROBLEM - Puppet failure on integration-saltmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [13:51:32] !log integration-slave-trusty-1015:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-qunit/src/node_modules [13:51:34] Logged the message, Master [13:53:29] RECOVERY - Puppet failure on integration-saltmaster is OK: OK: Less than 1.00% above the threshold [0.0] [13:58:37] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1227501 (10Aklapper) >>! In T95469#1226330, @mmodell wrote: > @christopher: I'd be ok with having the dates on the default form but that isn't really my call. Same here: Would... [14:04:34] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1227540 (10chasemp) This blocking task has not be resolved for to facilitate upgrade. Pursuant to: >>! In T95469#1217373, @chasemp wrote: >>>! In T95469#1199733, @ksmith wrot... [14:10:06] 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227545 (10hashar) I have applied `role::salt::masters::labs::project_master` and ran puppet. I took: * public key from `/etc/salt/pki/master/master.pub` * fingerprint via `salt-key -f /etc/salt/pki/... [14:15:32] 10Browser-Tests, 6Collaboration-Team, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1227570 (10SBisson) An error in ext/popups was preventing most our tests from running. After https://gerrit.wikimedia.org/r/#/c/202976/ is merged, we'll what... [14:19:16] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [14:19:26] 7Blocked-on-RelEng, 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, and 3 others: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1227614 (10coren) [14:19:29] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 2.073 second response time [14:19:30] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1791 bytes in 3.037 second response time [14:21:06] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1534 bytes in 6.801 second response time [14:21:34] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1534 bytes in 3.056 second response time [14:21:54] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1792 bytes in 6.183 second response time [14:23:10] 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227647 (10hashar) 5Open>3Resolved The salt autosigner is part of puppet class `puppetmaster::autosigner`. I have applied it and that creates the cron: ``` * * * * * /usr/local/sbin/puppetsigner.py... [14:28:30] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [14:29:27] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 47129 bytes in 0.639 second response time [14:29:29] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47221 bytes in 0.656 second response time [14:31:01] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.560 second response time [14:31:31] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.534 second response time [14:31:49] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28266 bytes in 0.544 second response time [14:31:55] 10Continuous-Integration: Set up salt for integration slaves in labs - https://phabricator.wikimedia.org/T87819#1227674 (10hashar) Announced on QA list https://lists.wikimedia.org/pipermail/qa/2015-April/002246.html [14:34:41] !log beta: failures on instances are due to them being moved on different openstack compute nodes (virt***) [14:34:44] Logged the message, Master [14:37:11] greg-g: (or someone else): I need one of you to give me a hand to make sure the deployment-prep is ready for idmap being turned off tomorrow. Shouldn't take very long (1-2h tops) [14:39:08] I have no idea what idmap is [14:42:00] greg-g: I've emailed regularily on the topic. :-) The short story: (a) precise instances that have not been rebooted since Apr 23 need to be rebooted, and (b) we need to make sure that anything that is written to /data/project is owned by a user that is either in LDAP or managed by puppet. [14:42:31] greg-g: In deployment-prep's case, (b) is already mostly done (maybe all) when we moved the users to LDAP a while ago. [14:43:33] greg-g: So it's really 99% just rebooting the instances in whichever order is least disruptive and making sure they come back up happy. [14:43:55] Coren: to where? [14:43:58] (didyou email) [14:44:14] labs-announce, labs-l and, iirc, wikitech-l for most. [14:44:18] thcipriani: ^ [14:45:53] yeah, I can help with that [14:46:18] least disruptive order would be something I'd need to explore a bit [14:48:13] thcipriani: Sure. From my side, the order is really immaterial so it's really for the project's benefit. "pull-out-the-bandaid and reboot 'em all now" vs "ginglerly, in a precise order". :-) [14:49:29] Coren: right. I'll probably want to get advice from hashar on that. [14:50:19] thcipriani: Well, he sent me to #wikimedia-releng specifically to talk with you guys. :-) [14:51:30] * Coren brb shortly, off to get tea. [14:56:08] mmm tea [15:00:16] is for the weak, french press here :) [15:02:23] <^d> We used to have a french press but it's too time consuming most mornings :p [15:03:09] took 2nd place in the Longmont adult science fair for brewing coffee :) https://tylercipriani.com/coffee-extract/ [15:03:55] * ^d is new to coffee [15:04:05] thcipriani: way to just roll in a with a sledgehammer there. "BAM! MATH!" [15:04:07] <^d> I finally got over my childhood aversion to the taste just a few months ago [15:05:03] heh, the first place I went in SF, was Blue Bottle, pretty awesome. [15:05:53] :) [15:09:10] <^d> thcipriani: Have you tried Sightglass? Also very good [15:09:47] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Nodepool, and 2 others: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1227905 (10hashar) A status update: The initial Debian packaging is ready for review https://gerrit.wikimedia.... [15:10:04] Coren: so, looking at deployment prep, there aren't too terribly many precise instances, but there may be some ownership issues for the deployment-prep project disk, logs is owned by 996 and cxserver is owned by 995 [15:10:19] ^d: no, but their website does look promising [15:12:25] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1227916 (10hashar) 3NEW [15:15:59] 10Continuous-Integration, 5Continuous-Integration-Isolation, 6operations, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1227948 (10hashar) I have poked our internal ops list to get some tips and hints. [15:19:10] thcipriani: I think we can afford some downtime on beta cluster [15:19:20] if we announce it ahead of time, it is probably acceptable [15:20:24] hashar: kk, my initial take was take down everything at once save deployment-salt since it's the puppetmaster and everything will want to hit it when they come back up [15:21:09] thcipriani: That's only an issue if those are accessed cross-instance /and/ the uids differ. [15:21:11] after all other precise instances have rebooted, then reboot deployment-salt—seem reasonable? [15:21:26] ah for NFS idmap [15:21:53] the main offenders were mwdeploy / apache [15:22:14] hashar: Right, and those are known to be uniform now. [15:22:15] since they share files on /data/project . But the users have been created on the NFS server [15:22:27] and I think some work has been done recently to ensure they are created with a stable UID [15:22:33] also I think we renamed apache to www-data [15:22:46] hashar: That's prod standard and a good idea anyways. [15:22:55] yup [15:23:03] apache was probably inherited from the Fedora era [15:23:22] (yeah we used to run Fedora) [15:23:41] then under /data/project there are a few more suspects [15:23:51] Mostly loggin afaict [15:24:12] /data/project/syslog/ files are written by syslog-ng on deployment-bastion but that is root:wikidev owned [15:24:32] /data/project/logs is the udp2log daemon on deployment-bastion [15:24:36] hashar: https://phabricator.wikimedia.org/T95554 has a recent list [15:25:08] ah great [15:25:47] /data/project/parsoid/ is written from the Parsoid instance -cant remember the name- [15:26:00] tl;dr: I see nothing especially worrisome in deployment-prep. Almost all logs. [15:26:03] anyway, I have no idea what NFS idmap is and what it can disrupt :/ [15:26:39] hashar: LOng story short: the only thing it can disrupt is if two instances access the same files currently with the same /username/ but different user ids. [15:26:39] the main trouble would be hosted files / thumbs in /data/project/upload7 which are written by multiple instances and owned by www-data:www-data [15:26:49] www-data is known to be fixed. [15:26:53] So not an issue. [15:27:39] It's uid 33 everywhere - and part of base so invariant. [15:27:42] so it is probably going to be fine :) [15:28:00] udp2log is quite needed [15:28:12] iirc that relays all mediawiki logs to logstash somehow [15:28:23] Right - I don't actually expect trouble. Doesn't mean I don't want to babysit the process. [15:28:27] looks like udp2log is only on flouride... [15:28:29] but maybe that relay doesn't even hit the disk [15:28:56] thcipriani: oh yeah, maybe ori migrated it out of deployment-bastion where I have set it up orginally [15:29:04] sorry that is a bit of a mess :( [15:29:28] we don't use udp2log to get logs into logstash anymore but it is used to get on disk logs in beta cluster and prod [15:29:40] * bd808 reads backscroll [15:29:40] er, sorry, udp2log is bastion-only [15:30:25] Things that are accessed by only one instance are guaranteed to be unaffected - you can't have an uid mismatch with yourself. :-) [15:32:00] the log path for things written to /data/project/logs is [host] > udp > [deployment-bastion] > udp2log > disk [15:32:27] [host] is any node running MediaWiki in the cluster [15:32:57] so from the looks of it udp2log should be fine. Now wondering about deployment-salt/log [15:36:23] Coren: I guess I'm not seeing the issues with deployment-salt/log from https://phabricator.wikimedia.org/T95554#1196735 [15:37:24] thcipriani: There may not be one - those are simply place where there are files owned by users which are simply not _guaranteed_ to have the same uid between instances. [15:37:39] Usually, those are owners managed by debian packages. [15:37:48] (As opposed to base, ldap, or puppet) [15:39:22] gotcha, yeah, instances seem happy and in agreement about uid==owner for those files, at least with a cursory check [15:42:56] hashar: where do we announce downtime for deployment-prep? Also, how do we prevent shinken spam while we're doing this? [15:46:49] thcipriani: Yuvi can hush shinken. I'd announce on labs-l, qa-l and engineering-l [15:47:07] and set the topic here when you are actually doing it [15:47:34] bd808: thanks! [15:57:18] Coren: this needs to be done by tomorrow for the idmap shut-off? Also, above, you said, "precise instances that have not been rebooted since Apr 23" need to be rebooted, which is tomorrow unless you meant last year or is a typo. [15:57:40] It's a typo. @^#%$. That was meant to be the 13th [15:58:12] thcipriani: Incidentally, reboot-if-idmap is a noop on boxes that don't need it. Might save you a reboot or two. :-) [16:01:44] thcipriani: I can try to hold off the idmap thing as long as I can if you need the extra time; I've got things that depend on it to progress but nothing will break if I delay. [16:02:23] Coren: kk, doing some salt spelunking now [16:28:45] twentyafterfour: scap deployment today? [16:32:24] marxarelli: yeah in about an hour [16:35:03] twentyafterfour: mind if i sit in? [16:35:19] no I don't mind at all [16:35:41] for a preview: this is the general outline: https://etherpad.wikimedia.org/p/mmodell [16:36:02] oh boy. that looks fun! [16:39:15] oh whoops, that's going to overlap with SoS [16:40:08] anyone want to attend SoS for me? greg-g? thcipriani? it's super fun. like the funnest [16:41:28] Coren: looks like all 19 precise boxes will need to be rebooted, after some deeper dives into files, I don't foresee any problems. I think we can reboot this afternoon, reboot all except deployment-salt then do deployment-salt, just to keep puppet happy. [16:43:26] marxarelli: I'm a chicken, not allowed [16:43:53] if we want to do a reboot this afternoon, I should announce a downtime now— YuviPanda: would you be able to shush shinken while this all happens if we decide to reboot this afternoon? [16:46:13] marxarelli: when and how long is SoS? And how painful :P [16:48:44] thcipriani: the deployment is a two hour window [16:48:59] 11:00 to 1:00 [16:49:06] er sory marktraceur [16:49:12] pinging wrong person [16:49:28] I'm in a meeting and somewhat distracted I'm sorry [16:50:00] twentyafterfour: no problem [16:50:48] thcipriani: not very painful [16:54:19] thcipriani: it's cake. you just fill out the etherpad with our team's updates, highlight any blockers, and answer questions should any other team be blocked by us [16:54:54] only 30 mins? not too bad, we're not blocked on anyone, seems like idmap may be one of the few things blocked on us. marxarelli I can jump in there if you wanna watch a deploy. [16:55:11] thcipriani: nice! thanks homey [16:56:55] marxarelli: np [17:01:06] marxarelli: was there anything that needed mentioning about isolated CI? [17:03:25] thcipriani: Sounds excellent. [17:04:10] thcipriani: you could mention the breaking up of the pool into smaller instances but unless there's a blocker i don't usually go into much detail [17:05:07] Coren: You around 2pm PDT to do that? I'm assuming you can stop the libvirt hosts so I don't have to go all clicky on wikitech, or open 19 shell sessions :) [17:05:26] marxarelli: kk, thanks [17:06:13] thcipriani: I'll be standing by for you and can reboot in batch - just give me a list of instances and I'll roll it off. [17:16:02] Coren: Here's the instance list and reboot groups: https://phabricator.wikimedia.org/P545 [17:16:42] Esckchellent. [17:17:07] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:10] thcipriani: I presume there is no issue if group 1 are rebooted in batches of 2-3? [17:17:40] * Coren doesn't want to bring the virt hosts to their knees with too many simultaneous reboots. [17:19:04] Coren: I don't _think_ so, but I moved the db servers to the end of the list, just to make me feel good. [17:19:16] fair 'nuf. :-) [17:20:31] Coren: how long will it take, roughly, for the whole reboot? Want to give some padding in the announce. [17:21:30] Hm. 7 batches or so - count 15m to give some elbow room. Reboots proper tend to take only a minute or so. [17:21:40] So give a 30m window and we're golden. [17:21:55] 1h if you're feeling paranoid. [17:21:59] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.525 second response time [17:23:21] Coren: 1 hr it is :) [17:39:11] is there a good way to get more verbose output from the browsertests? specifically i would like to see what watir is actually telling the browser to do step by step. I've tried the --verbose option to cucumber, but that just reports on cucmber don't get anything from watir [17:43:48] ebernhardson: you can try setting `$DEBUG = true` in a before hook. i can't remember if that will echo the selenium commands or not, but it will at least spit out the xpath/css selectors after they're compiled [17:44:30] so add `Before { $DEBUG = true }` to your env.rb or step definition file [17:45:57] trying that now, thanks [17:46:10] had to google cucumber before hook, found that part :) [17:48:55] marxarelli: ahha, thanks that is giving me more info to work with at least [17:52:11] thcipriani: you are obviously a serious unix guru. That's an epic beard you have today. :) [17:52:59] well, ya know, I try :) [17:53:21] * greg-g 'll have to trim mine soon so I don't feel bad [17:54:01] also, to be fair, I do sport this epic beard every day [17:54:36] and also, to be fair, some days I feel a little outmatched by greg-g's beard. [17:54:49] I wear my beard on the inside -- https://www.youtube.com/watch?v=_yVPewAybZw [17:54:51] +1 on thcipriani’s beard [17:54:59] thcipriani: greg-g’s mane you mean [17:57:03] * greg-g is growing the hair out to match [17:57:18] * YuviPanda wishes he could do that [17:57:23] lately i've noticed that greg-g's head hair is encroaching on his beard's territory and wondered whether they will unite in harmony to become the Unimane or if there will be trouble [17:57:26] alas, genes and what not [17:58:21] marxarelli: there are pictures of me from my camp counselor days on FB that show what it looks like after 6 years of head hair growth and 6 months of beard growth [17:58:30] !log Creating integration-slave-trusty-1021 per T96629 (using ci1.medium type) [17:58:35] Logged the message, Master [17:58:56] let me guess, the beard grows as long in 6 months as the hair does in 6 years? [17:59:08] greg-g: oh man, i have some of those too [17:59:20] marxarelli: deploy time [17:59:21] lot's of Jesus and yeti comments [17:59:21] thcipriani: basically, at least relative to my shoulders. [17:59:28] twentyafterfour: booyah [18:00:15] lets see, does tin have tmux...or you wanna use screen sharing on hangouts? [18:00:15] twentyafterfour: hangout probably [18:00:16] + tmux if necessary [18:00:16] yeah no tmux [18:02:15] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [18:02:47] 5Continuous-Integration-Isolation, 6operations: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1228381 (10RobH) a:5RobH>3chasemp I'm going to assign this to chase, only while the discussion is pending about the networking. (Since he is discussing with @mark).... [18:07:02] <^d> twentyafterfour: fwiw, 1.26wmf2 was tracking master earlier instead of its proper branch. I fixed it and scapped. You shouldn't have any problems with it but just fyi in case it broke again. [18:07:12] <^d> (during swat this morning) [18:07:21] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 130.68 ms [18:09:14] thcipriani: ^d just wanted to say thanks for all the work done on staging :) [18:09:38] <^d> Lots of nice cleanup happened :) [18:11:32] YuviPanda: thank you—lots of mountains moved with your effort. Hopefully, we can rebuild momentum on that project within a couple of quarters. [18:11:41] +1 [18:11:52] thcipriani: ^d what’re we going to do with all the curren hosts? let ‘em lie idle? [18:12:31] <^d> While nobody's going to be working on it officially it might end up being something we work on while bored, so I wouldn't kill it all [18:13:11] fair enough [18:13:16] I don't know how much those virt resources are needed, but it is nice to have around for the sole fact that I'm not worried about breaking it :) [18:13:19] and we’re not crunched for space atm [18:13:32] but I guess we can do something *if* we get crunched [18:15:23] <^d> Yep, if we need extra resources the stuff in staging would be top of the list to kill imho [18:15:44] cool [18:17:26] <^d> We should probably finish pushing any open patches through that we still have [18:17:51] <^d> for me, that's dsh + unifying at least nonexistent.conf in web_sites [18:18:38] yeah, I think at least dsh [18:18:44] i’ve almost 0 overlap with _joe_ now tho [18:18:55] I’ll push it through over the next few days [18:19:11] <^d> I'm rolling back to your PS4 instead of my PS5 on that. [18:19:15] <^d> Your approach was much simpler [18:19:18] yeah [18:19:20] <^d> (and I could always follow up with my ideas) [18:19:27] long term I like yours tho [18:20:58] <^d> ugh, pushing an old patch says no new changes [18:21:00] <^d> stupid gerrit [18:21:28] <^d> I'll rebase then [18:43:26] PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:00:43] 10Browser-Tests, 6Collaboration-Team, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1228692 (10DannyH) [19:01:49] 10Browser-Tests, 6Collaboration-Team, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1156329 (10DannyH) Elena will check this over. [19:08:03] 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1228720 (10DannyH) [19:08:25] RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [19:13:48] anyone know why the zuul queues are so backed up? [19:14:20] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2182.64 ms [19:16:41] 7Blocked-on-RelEng, 6Release-Engineering, 6Multimedia, 6Scrum-of-Scrums, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1228790 (10Tgr) [19:22:19] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [19:22:47] <^d> twentyafterfour: Was about to ask the same [19:22:58] <^d> jobs have returned as ok, zuul doesn't know [19:23:19] good evening [19:23:47] <^d> sounds like gearman? [19:23:58] what is going on ? [19:24:06] <^d> queues backed up [19:24:08] <^d> https://integration.wikimedia.org/zuul/ [19:24:10] damn [19:24:24] there are a bunch of graphs at the bottom [19:24:30] <^d> The jobs on top look like the finished in jenkins but zuul doesn't have their return status yet. [19:24:35] <^d> *they [19:24:35] the Zuul Geard job queue has empty / null values [19:24:41] so that would indicate the process is stalled [19:25:01] <^d> <^d> sounds like gearman? [19:25:03] <^d> :D [19:25:14] and at the top [19:25:17] Queue lengths: 57 events, 39 results. [19:25:25] so Zuul has a lot of pending events and is deadlocked somehow [19:25:35] * hashar looks at error log [19:25:40] on gallium in /var/log/zuul/ [19:26:10] NoConnectedServersError: No connected Gearman servers [19:26:12] grbmbmbm [19:26:25] 3705 ? Sl 35:06 /usr/share/python/zuul/bin/python /usr/bin/zuul-server -c /etc/zuul/zuul-server.conf [19:26:25] 3718 ? Sl 69:08 \_ /usr/share/python/zuul/bin/python /usr/bin/zuul-server -c /etc/zuul/zuul-server.conf [19:26:31] the first process is the Zuul scheduler/ server [19:26:39] on start it forks to spawn the gearman server [19:26:46] which listens on port 4730 [19:27:01] lame way to ping it: $ zuul-gearman.py status [19:27:41] might be stalled on some Gearman client connection. [19:27:55] !log Zuul gearman is stalled. Disabling Jenkins gearman client to free up connections [19:27:58] Logged the message, Master [19:28:43] maybe we should a different gearman daemon [19:29:14] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2411.58 ms [19:29:40] ok gearman no more stalled [19:29:56] reenabling the Gearman client [19:30:11] !log Gearman went back. Reenabling Jenkins as a Gearman client [19:30:14] Logged the message, Master [19:33:04] twentyafterfour: ^d: so there is a bug in Zuul gearman which cause it to stall completely :( [19:33:17] it is blocked trying to read on a socket that has no more data [19:33:22] and there is no timeout for the read call :/ [19:33:32] the only way to fix it is to kill the socket [19:33:37] <^d> So you disable the plugin, bounce gearman, then turn the plugin back on? [19:33:40] which is by disabling the Jenkins gearan client [19:33:50] this way that close the scoket, free up the blocking read() call [19:33:54] and resume operation [19:34:00] yup [19:34:05] all from https://integration.wikimedia.org/ci/configure [19:34:15] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [19:34:38] I havent found a way to reliably reproduce the bug :( [19:37:17] hashar: Hm.. too bad salt uses the internal hostname, not the custom one [19:37:25] '*slave*' matches nothing [19:37:44] Krinkle: yeah I have noticed that. I guess that is because the human friendly hostname has no guarantee to be unique [19:40:18] !log reenabling Jenkins gearman client [19:40:19] damn [19:40:20] Logged the message, Master [19:40:29] I flushed the jobs results per mistake [19:41:42] hashar: We also have two browsertest jobs stuck on IRC freenode [19:42:01] :(( [19:42:08] How did you kill that socket again? [19:42:11] Can you write on https://wikitech.wikimedia.org/wiki/Release_Engineering/Argh ? [19:42:12] and I swear I have downgraded the plugin [19:42:48] Ah, I added section last time. I forogt [19:43:12] so [19:43:16] lets kill Jenkins :( [19:43:18] Hm.. I don't see bash commands on https://phabricator.wikimedia.org/T96183 though [19:43:21] Yeah [19:43:29] I hope the Zuul queue can be preserved? [19:43:35] yeah [19:44:09] Ah, right. The $packit didn't fix it last time [19:44:47] yeah packit is just to terminate the CLOSE_WAIT socket [19:44:53] but the underlying java code does not react to it [19:46:21] java :-/ [19:46:29] well java is just fine [19:46:36] but the lame code is causing issues [19:46:38] :D [19:46:53] surely if it was written in python or php it would be easier to fix for us [19:48:06] well yeah, java is just so monilithic, when it fails it requires hard rebooting. it's like windows [19:49:15] twentyafterfour: in my experiences Windows requires re-install of the OS if it fails. [19:49:25] I guess we will want to get rid of the IRC plugin [19:49:30] and roll out our own notification system [19:49:34] Krinkle: yeah then there's that [19:49:52] I stopped using windows right around the time that vista came aout [19:49:57] haven't touched it since then [19:50:09] hashar: sounds like a good idea [19:50:10] Jenkins back up [19:50:18] !log zuul/jenkins are back up (blame Jenkins) [19:50:20] Logged the message, Master [19:50:27] I must've wasted countless hours between the ages of 8 and 14 rebooting Windows 95/98/ME computers only to encounter the same BSOD or DLL error again. [19:51:19] 95 ??? [19:51:49] I have stopped with XP which I nicknamed the Care Bears OS ( http://en.wikipedia.org/wiki/Care_Bears ) [19:52:02] I think I've only seen maybe 20 kernel panics since I switched to only using unix. Most of those were Mac OS X (and I fiddled with making a hackintosh, so kernel panics were to be expected ) [19:52:14] though I have stuck to Win 2K for quite a while (I had a multi proc machine and was playing games) [19:53:29] Ah yeah. OS X 10.2-10.4 I had kernal panics quite often. Not sure what I did wrong. [19:53:33] It never happened to my parents. [19:53:50] But then again, they weren't really "using" the computer the way I did. [19:54:12] 00:03:35.368 ERROR: '[earthquake] Url exceeds maximum length' [19:54:14] lovely message [19:54:16] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3199.34 ms [19:54:24] from https://integration.wikimedia.org/ci/job/mwext-Flow-qunit/5548/consoleFull [19:54:26] hashar: props for Ori to that EventLogging unit test. [19:54:40] It's an expected error being tested [19:55:12] and of course we can't capture/suppress output right? [19:55:35] oh [19:55:44] and Krinkle kudos for the new Zuul status page [19:55:53] the subway like pipelines are quite nice to see in the gate-and-submit [19:56:15] hashar: We can capture it actually. I implemented suppressWarnings()/restoreWarnings() in QUnit last year. [19:56:17] I'll add it [19:56:20] to EL test [19:56:24] \o/ [19:56:25] 10Continuous-Integration, 10Wikimedia-Hackathon-2015: All new extensions should be setup automatically with Zuul - https://phabricator.wikimedia.org/T92909#1228923 (10Jdlrobson) See : https://gerrit.wikimedia.org/r/#/c/205726/ this is wasting reviewers time unnecessarily. [19:57:08] 10Continuous-Integration, 10Gather: PHPUnit tests do not get run by Jenkins for Gather commits - https://phabricator.wikimedia.org/T96904#1228924 (10Jdlrobson) 3NEW [19:57:14] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 16%, RTA = 388.06 ms [19:59:06] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:00:39] hashar: since you're around for the moment. Do you see any problem in doing deployment-prep reboots divided up like this: https://phabricator.wikimedia.org/P545 [20:00:47] just wanted a quick sanity check [20:01:04] thcipriani: that is for the NFS idmap right? [20:01:12] yes [20:01:27] from a quick conversation I had with coren earlier today [20:01:43] we can probaqbly reboot everything however we want [20:01:56] any reason to have salt rebooted separately? [20:02:09] just becuase when instances come back up they'll want to do a puppet run [20:02:17] oh true [20:03:12] I couldn't think of any other reasons to group any reboots. [20:03:24] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2126.43 ms [20:03:33] yeah I think you can jusdt mass reboot everything [20:04:02] the deployment-cache* machines should not be hitting the NFS shares [20:04:05] might be good candidates [20:04:31] 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1228987 (10DannyH) [20:05:46] hashar: cool, thanks for the reassurance. I scheduled the reboot for 2pm PDT. [20:06:32] I think the only potential screw up would be instances not able to write to /data/project due to some uid mismatch [20:08:29] I think the biggest concern is uid mismatches between machines. I spent 15 mins or so this morning reviewing owner/uid stuff—everything seemed ok [20:08:39] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<50.00%) [20:08:42] ^ "everything seemed ok" famous last words [20:13:09] 10Browser-Tests, 10MediaWiki-extensions-UploadWizard, 6Multimedia: Fix failed UploadWizard browsertests Jenkins job - https://phabricator.wikimedia.org/T94161#1229081 (10MarkTraceur) [20:14:25] PROBLEM - Puppet failure on integration-slave-trusty-1021 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:16:25] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuos-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229102 (10Krinkle) 3NEW [20:16:46] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229109 (10Krinkle) [20:19:41] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229116 (10Krinkle) p:5Triage>3High [20:20:00] !log gzipped /var/log/pacct.0 on deployment-bastion [20:20:02] Logged the message, Master [20:25:32] hallo [20:25:48] does anybody know why is this failing? - https://gerrit.wikimedia.org/r/#/c/205260/ [20:26:03] I see nothing useful in the mwext-testextension-zend output [20:29:23] RECOVERY - Puppet failure on integration-slave-trusty-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [20:30:47] 10Continuous-Integration, 10MediaWiki-extensions-Scribunto, 7I18n: mwext-testextension-zend fails when changing namespace aliases in Scribunto - https://phabricator.wikimedia.org/T96912#1229176 (10Amire80) 3NEW [20:33:00] 10Beta-Cluster, 10VisualEditor: Cannot open any page with VE in Betalabs, getting error "Error loading data from server: internal_api_error_DBConnectionError: [8c78efd3] Exception Caught: DB connection error: Can't connect to MySQL: - https://phabricator.wikimedia.org/T96905#1229198 (10Ryasmeen) p:5Triage>3... [20:33:39] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [20:37:38] thcipriani: Ima go eat a quick bite then I'm standing by. [20:37:57] Coren: kk, whenever you're back I'm ready [20:58:26] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229330 (10hashar) Since new tasks land in our `Untriaged` column should we get a `Config` column to hold them? The advantage would be to still have... [21:01:46] Krinkle: definitely in favor of separating CI tasks. That is a great idea. I replied on https://phabricator.wikimedia.org/T96908#1229330 [21:02:08] Krinkle: merely suggesting to create an additional column but that is probably not that much of a good idea [21:02:34] thcipriani: I'm ready when you are. [21:03:26] 10Continuous-Integration: Run QUnit tests via SauceLabs - https://phabricator.wikimedia.org/T96919#1229336 (10Krinkle) 3NEW [21:04:09] Coren: Yup, I'm ready [21:04:12] FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [21:04:18] hashar: The problem is triaging and backlog. Without a separate project we won't know until after we triage. [21:04:20] That's extra work [21:06:02] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229351 (10Krinkle) >>! In T96908#1229330, @hashar wrote: > Since new tasks land in our `Untriaged` column should we create a `Config` column to hold... [21:06:06] thcipriani: In progress. [21:06:16] Coren: watching [21:06:35] thcipriani: btw: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ :( [21:07:04] \o [21:07:07] jdlrobson: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [21:07:29] short answer: yes, beta *cluster* (see also: https://wikitech.wikimedia.org/wiki/Labs_labs_labs ;) ) isn't up to date right now [21:07:58] doh. any estimates on when it's likely to be fixed? Have a product owner asking to test some stuff :) [21:07:59] 10Continuous-Integration, 6Release-Engineering: Run qunit tests in IE8 (and possibly other Grade A browsers) - https://phabricator.wikimedia.org/T96432#1229355 (10Krinkle) [21:08:15] no eta at the moment [21:08:17] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1229358 (10hashar) Perfect we are on the same line. I just wanted to make sure you had the same idea :-) [21:08:21] Krinkle: excellent thanks a ton. [21:08:35] Krinkle: then we can rename Continuous-Integration to Continuous-Integration-Infra ? [21:08:41] Yes [21:09:16] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2174.49 ms [21:09:28] twentyafterfour: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/lastBuild/console :( [21:10:02] greg-g: bug i can subscribe to? [21:10:05] Krinkle: nice. I am not there tomorrow but will back on friday. Will write the meeting minutes [21:11:20] jdlrobson: frmo thcipriani mail to engineering list a couple hours [21:11:30] 10Beta-Cluster: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1229359 (10greg) 3NEW a:3mmodell [21:11:38] jdlrobson: done ^ (already cc'd you) [21:12:12] hashar: also that [21:12:29] jdlrobson: unrelated, there will be a beta cluster outage starting -12 minutes ago ish [21:12:34] PROBLEM - Puppet failure on deployment-bastion is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:13:12] thcipriani: That one isn't expected, I think. Want to look into it? ^^ [21:13:17] looking now [21:13:24] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [21:13:32] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1536 bytes in 0.287 second response time [21:13:57] (Might just be a check for puppet freshness at the wrong time) [21:14:51] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1796 bytes in 2.359 second response time [21:16:15] Coren: re-ran puppet on deployment-bastion, seems fine now [21:16:29] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [21:16:42] Yeah, okay, that looks like the check is sensitive to reboots. [21:17:53] looking at deployment-restbase01 [21:18:01] deployment-eventlogging02.eqiad.wmflabs is problematic; it appears to be out of date w/ puppet master [21:18:30] Which means that the patch that is meant to be applied with a reboot probably isn't there. [21:18:33] * Coren checks. [21:18:48] The last Puppet run was at Wed Apr 1 01:44:58 UTC 2015 (31413 minutes ago). [21:19:13] * thcipriani looks [21:19:16] Indeed. Should I patch it manually or was puppet supposed to be running there but wasn't? [21:19:37] hashar: ^ ? [21:19:48] Coren: I'm pretty sure puppet was supposed to be running [21:19:52] * thcipriani looks at SAL [21:20:43] Coren: thcipriani no clue [21:21:02] for eventlogging your best chance is to ask ori probably [21:21:09] huh, looks like it's been stalled before, but not record of it being paused [21:21:25] Feb 03 09:15 hashar: Running puppet on deployment-eventlogging02 has been stalled for 3d15h. No log :-( [21:21:45] damn [21:22:03] ah [21:22:14] analytics folks show up in `last` [21:22:26] so maybe fill a ticket about it for Analytics [21:22:30] and hold reboot ? [21:22:31] RECOVERY - Puppet failure on deployment-bastion is OK: OK: Less than 1.00% above the threshold [0.0] [21:22:45] hashar: I can apply the (trivial) fix and reboot manually. [21:23:31] yeah that would work [21:23:42] I guess when puppet is enabled again that would be a noop for that fix [21:23:49] filling the task meanwhile [21:25:26] hashar: It will; it's just a file addition. [21:25:35] sounds sane so :) [21:27:18] thcipriani: All rebooted except for deployment-salt. [21:28:06] all the puppet fails from above seem to have self-corrected. [21:28:16] go ahead and kick salt [21:31:44] 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1229393 (10hashar) 3NEW [21:31:44] thcipriani: I'm all done, and the patch applied neatly. [21:32:46] thcipriani: Everything looking okay on your end too? [21:33:06] Coren: yup, everything looks ok for now, sorry, digging still [21:33:31] 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1229402 (10hashar) I have set a message pointing to this task by using: puppet agent --enable; puppet agent --disable 'https://phabric... [21:33:40] Unable to read /srv/mediawiki-staging/php-1.26wmf3/extensions/CiteThisPage/CiteThisPage.php [21:33:46] thcipriani: bits.beta 500s though [21:34:08] PROBLEM - Puppet failure on deployment-mediawiki02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:34:35] PROBLEM - Puppet failure on deployment-memc04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:34:40] as does mediawiki01 [21:34:41] twentyafterfour: That's not mine - the only paths that can possibly affected are on NFS: /home and /data/project (well, also /data/scratch in theory) [21:35:55] twentyafterfour: sometime files get lost on the staging area :/ [21:36:19] I don't think it got lost [21:36:22] twentyafterfour: though that should use /php-master/ [21:36:56] or was your CiteThisPage issue on prod? [21:37:29] mariadb didn't restart seemingly [21:38:32] mediawiki02 apache2 : (116)Stale file handle: AH00646: Error writing to /data/project/logs/apache-access.log [21:38:35] bah [21:38:48] that comes from https://logstash-beta.wmflabs.org/ [21:39:18] hashar: What's the instance behind that? [21:39:25] deployment-mediawiki02 [21:39:46] Hah! That one didn't reboot? [21:39:53] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28267 bytes in 5.753 second response time [21:40:04] Wait, it wasn't in my list. [21:40:13] /data/project/logs/ is owned by udp2log:udp2log [21:40:20] so that error probaqbly existed before [21:40:39] !log restarted mariadb on deployment-db{1,2} [21:40:42] Logged the message, Master [21:40:46] er, started, I gues [21:40:56] CiteThisPage.php exists and is readable ...weird [21:41:12] hashar: It wasn't in my list because it wasn't precise. :-) [21:41:14] yeah mediawiki02 is 14.04 [21:41:23] ah [21:41:27] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:41:41] * thcipriani looks at parsoid05 [21:41:46] be back in 5, bio [21:43:32] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46938 bytes in 0.460 second response time [21:44:11] ^ parsoid05 looks ok, too [21:45:19] 10Continuous-Integration, 10MediaWiki-extensions-Scribunto, 7I18n: mwext-testextension-zend fails when changing namespace aliases in Scribunto - https://phabricator.wikimedia.org/T96912#1229456 (10Anomie) The problem seems to be that occasionally a timeout in LuaStandalone is reported as a read failure rathe... [21:45:48] 10Continuous-Integration, 10MediaWiki-extensions-Scribunto, 7I18n: LuaStandalone timeout is sometimes reported as read error - https://phabricator.wikimedia.org/T96912#1229457 (10Anomie) [21:49:10] memc04 looks fine, even though shinken is upset about its puppet run [21:55:52] deployment-cache-text02 has some problem with the ssl key... [21:56:31] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:59:00] which, maybe, has been happening on all the cache servers for a week...? [21:59:07] RECOVERY - Puppet failure on deployment-mediawiki02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:59:33] RECOVERY - Puppet failure on deployment-memc04 is OK: OK: Less than 1.00% above the threshold [0.0] [22:01:29] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:04:11] Stupid dog seems to have an instinct "If I'm sick, I have to go to the most expensive fabric thing around first." [22:05:02] thcipriani: Are you ready to deliver a verdict? [22:05:34] yeah, everything is back to normal now. Normal being the same problems we had before the reboot :\ [22:05:47] I'll send out the email [22:06:07] re: dog instincts http://www.sheldrake.org/books-by-rupert-sheldrake/dogs-that-know-when-their-owners-are-coming-home [22:07:52] 10Beta-Cluster: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1229530 (10Jdlrobson) [22:07:55] hashar: Coren: thanks for your assistance :) [22:13:38] thcipriani: congratulations! [22:14:12] hashar.sleep() [22:14:26] have a good night hashar [22:14:31] thanks again [22:18:57] Have fun guys. Surface any oddities with NFS to me, but I shouldn't expect any. [22:18:58] o/ [22:19:46] legoktm: Ah, I guess you were missing the doc_subpath because the job name didn;t end in _publish [22:19:49] https://github.com/wikimedia/integration-config/commit/ac8bdf3a78995e434d089d1a48ceea0638328c0f [22:26:40] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1229604 (10chasemp) @KLans_WMF and @Awjrichards could you guys weigh in on where you want to go from here? I'm not sure if https://phabricator.wikimedia.org/T95469#1223742 ex... [22:29:32] twentyafterfour: whatever was happening with CiteThisPage during your deploy today may also be the cause of https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ saddness [22:30:07] Unable to read /mnt/srv/mediawiki-staging/php-master/extensions/CiteThisPage/CiteThisPage.php on mergemessagefilelist [22:30:27] thcipriani: yes [22:30:27] twentyafterfour: fyi hoping we can push this along https://phabricator.wikimedia.org/T95469#1229604 [22:30:31] that is exactly the cause [22:31:35] fix is https://gerrit.wikimedia.org/r/#/c/205988/ [22:31:53] awaiting +2 though I am about to submit a fix that avoids the issue for all extensions [22:32:50] cool beans. [22:38:00] https://gerrit.wikimedia.org/r/#/c/205999/ <-- thcipriani [22:49:59] 10Deployment-Systems, 6Community-Liaison, 6Multimedia: New Feature Notification - https://phabricator.wikimedia.org/T77347#827765 (10Quiddity) [22:50:54] twentyafterfour: hah I wrote the same patch [22:55:56] twentyafterfour: I can +2 that CiteThisPage patch if you need it [22:56:37] well I guess we can abandon it if the other change fixes the same problem [22:57:03] I already self-reviewed the patch for the release branch so I could get on with deploying [22:57:21] I just cherry picked it to master so I wouldn't have the same problem next time [22:59:03] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce build #29: FAILURE in 2.4 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce/29/ [23:21:05] Yippee, build fixed! [23:21:05] Project beta-update-databases-eqiad build #9105: FIXED in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9105/ [23:22:51] imma gonna mess with logstash in beta cluster a bit. testing changes for prod [23:24:36] twentyafterfour: Do you not join #wikimedia-dev as a general rule or are you just not there today? [23:25:37] bd808: I don't know, there are a lot of channels ;) [23:25:46] true dat [23:26:26] I thought -dev was just a lot of botspam [23:26:54] there is a lot of botspam but good discussion too. generally of "review this plz" nature [23:27:21] It's the wikitech-l of irc IMO [23:28:34] honestly the level of bot spam we have on our channels makes them almost unbearable for me. i wish we had bot channels and chat channels separated completely [23:28:58] !log deployment-salt:/var/lib/git/operations/puppet in detached HEAD state; looks to be for cherry pick of I46e422825af2cf6f972b64e6d50040220ab08995 ? [23:29:01] Logged the message, Master [23:29:37] I have my client tweaked out to make bots smaller and lighter font. Makes it easier to see the real people [23:30:11] https://github.com/bd808/Textual-Theme-bd808/blob/master/src/scripts/mute-senders.coffee [23:33:14] !log reset deployment-salt:/var/lib/git/operations/puppet HEAD to production; forced update with upstream; re-cherry-picked I46e422825af2cf6f972b64e6d50040220ab08995 [23:33:17] Logged the message, Master [23:33:28] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 3.048 second response time [23:35:50] bd808: that's pretty nice, I was actually working on something like that for glowingbear .. [23:36:05] I'm on linux so textual isn't an option [23:37:33] there are a lot of things not to love about textual, but the ui being safari and easy to tweak with js is pretty nice [23:39:12] gwicke: were all the "Set up /api/v1/ entry point for restbase" puppet things you? [23:40:19] bd808: bblack & I, yes [23:40:34] do you mean the notifications? [23:40:48] the puppet.git activity [23:41:05] It was in a detached head state. I fixed that [23:41:17] the last cherry-pick is back on there though [23:41:41] ah, on deployment-salt [23:41:56] we are just getting ready to deploy that to prod [23:42:13] so can drop it in labs [23:42:40] once you are done with yours [23:43:35] it will clean up automagically probably. I'll make sure it doesn't get stuck [23:43:58] k, thx! [23:44:09] could also do a rebase -i otherwise [23:47:22] how I make merge job that didn't make it through to re-initiate the merge? [23:47:38] specifically https://gerrit.wikimedia.org/r/#/c/203837/ [23:48:18] SMalyshev: re-review with a 0 then +2 again [23:48:29] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47221 bytes in 0.574 second response time [23:49:07] bd808: aha, thanks, that seems to wake it up [23:56:03] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce build #29: FAILURE in 2.5 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce/29/ [23:57:32] !log cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205968 (remove redis from logstash) [23:57:35] Logged the message, Master