[00:08:08] * Coren says evil things about idmap. [00:08:15] heh :) [00:08:29] Well, not idmap per se - it works fine. [00:09:02] But I've been trying to find some way to do things gradually and that just won't work. Turning off idmap on the server will just break every precise instance. [00:09:30] Switching idmap off on precise instances without turning it off on the server: also breaks. [00:12:05] Coren: will it ‘break every precise instance’ or ‘break every precise instance that relies on idmap behavior’? [00:12:25] I think those two sets coincide. [00:12:34] Well, instances that don't actually use NFS at all won't care. [00:13:53] hmm [00:14:28] (this is also why we should get rid of precise :D) [00:16:19] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1192672 (10yuvipanda) [00:16:31] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Make sure tools-db is replicated somewhere - https://phabricator.wikimedia.org/T88718#1192673 (10yuvipanda) [00:18:02] If we're careful with doing the uid sync up beforehand, it shouldn't be *too* difficult to turn idmap off but I think we're stuck having to do it all at once. [00:18:17] Coren: anyway, do update the phab ticket - maybe paravoid / other opsen might have ideas. He did have a ‘propose the following’ in the bug description. [00:18:34] Yeah, will do. [00:29:38] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Ensure that all running webservices have a services.manifest file - https://phabricator.wikimedia.org/T95095#1192693 (10yuvipanda) 5Open>3Resolved a:3yuvipanda ```$ diff -u0 <(qstat -j lighttpd-\*,tomcat-\*,uwsgi-\*,nodejs-\* -u \* -xml | sed... [00:29:39] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1192696 (10yuvipanda) [00:30:15] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1192703 (10yuvipanda) [00:30:17] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1192698 (10yuvipanda) 5Open>3declined a:3yuvipanda All running webservices have a service manifest now (T95095) and that should be enough. [00:35:09] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1192714 (10yuvipanda) [00:42:29] Coren: is there a way to look at ‘last X jobs scheduled on a particular host’? [00:42:47] I’ve a feeling that everything on the old tomcat host is now basically just things using qsub without restrictions [00:44:29] YuviPanda: Indirectly. The log in /var/spool/gridengine/execd/*/messages will have the info [00:44:40] Coren: ah, on the client? [00:44:48] On the node itself. [00:44:52] right [00:44:54] looking [00:47:54] Coren: hmm, I don’t really see anything about accepted jobs, only about hard kills [00:48:24] Hm. [00:48:50] We may not have enough verbosity for job starts. [00:50:36] I’m also considering logging jsub and qsub invocations in some form [00:50:42] just the command line invocation used [00:50:46] with… maybe EventLogging [00:50:59] (writing to a public file on NFS doesn’t seem ideal) [00:51:08] Eeew. Send 'em to syslog if nothing else - easier to filter and properly rotated. [00:51:18] hmmm [00:51:22] EventLogging is centrallized. [00:51:27] * YuviPanda considers [00:52:05] well, Centralized, Easy to run stats on, and what not. Only problem with syslog is I don’t want to setup aggregation myself... [00:52:25] I might be overestimating that difficulty of course... [00:52:47] EventLogging also gives me JSON [00:52:49] which is quite nice [00:52:50] err [00:52:55] JSON input, mysql output [00:53:22] Sure but then you gotta clean up regularily otherwise that table will grow without bound. :-) [00:53:41] oh I don’t think that’s going to be a problem [00:53:49] they logged all clicks on infoboxes on mobile for a while without trouble :D [00:53:54] I doubt jsub is going to cause problems [00:54:21] I suppose. :-) [00:54:22] i haven’t really fully thought about it yet. Need to figure out what is the problem I’m trying to solve [00:54:30] and then figure what to do about it, rather than start with the solution [00:54:44] anyway, that’s for laters. Now to write puppet code for the manifests [00:55:19] Coren: btw, tools-manifests is running on bastion atm (accidentally, but I’m letting it). And all tools with bigbrotherrc files or with running webservices have a service manifest now :) [00:55:47] \o/ so bigbrother can be disabled for a large variety of things. [00:55:54] hmm, manifester doesn’t email tools yet [00:55:59] toollog needs a little more work [00:59:18] Coren: I’m going to call the new machines tools-services-01, is that ok? [00:59:25] (I’ll have a -02 for hot swap if needed) [01:00:09] * Coren doesn't particularily enjoy arguing over the paint color. :-) [01:00:27] Anything works for me so long as it's not ridiculous. Or accidentally offensive. :-) [01:01:39] :D [01:01:40] hehe [01:01:59] Coren: cool. I am also wondering if we should disable the webservice parts of bigbrother [01:02:02] but maybe not. [01:02:21] The redundancy can't hurt atm - I'd rather keep both for a bit. [01:02:53] yeah [01:02:54] totally [01:03:28] Coren: I’ll add ‘worker’ support next week and then we can get rid of bigbrother alltogether :D [01:03:31] * Coren watches you carefully before you name some log aggregation service "Anschluß". :-P [01:03:33] (after having them run together for a week) [01:03:39] what’s Anschulb? [01:03:48] ouch [01:03:49] I see [01:04:36] Coren: about 5 years ago, I wrote a JS Gadget to make doing article assessments for articles in wikiproject India easier... [01:04:38] Just jossing you of course. :-) [01:04:44] Coren: And insisted on calling it ‘AssBar’ of course... [01:04:45] >_> [01:04:53] I got talked out of it by Sumana. [01:04:54] * Coren giggles. [01:04:57] ESL for the win! [01:05:01] And called it AssessmentBar [01:05:50] which reminds me, I should probably add a logrotate to the package [01:05:55] I wonder if upstat auto rotates [01:06:18] Coren: also, in tools.wmflabs.org?status - you’ll notice that the new nodes (tools-exec-2*) have close to 0 VMEM free [01:06:29] that’s because while the others were overprovisioned on VMEM compared to actual RAM, these weren't [01:06:33] should we? [01:07:12] Only if you are confident that most/all of the jobs on it share executables. [01:07:29] Hm, also, vmem is set to some large fraction of ram+swap not just ram. [01:07:39] Coren: it’s just inconsistent from the other exec nodes. [01:07:57] Overprovision = more than ram+swap [01:08:08] totally, and most of the other exec nodes are overprovisioned [01:08:21] They really shouldn't be. Which ones? [01:08:58] Coren: look at tools-exec-03 for example [01:09:00] 31G VMEM [01:09:31] lots of the exec nodes seem to have free vmem 0M and memory some absurdly low number [01:09:39] does GridEngine still schedule things on them? [01:09:58] Coren: or look at tools-exec-12 :D [01:18:53] -03, for instance, has 24G swap and 8G ram so 31G vmem is not over [01:19:04] ... [01:19:06] 24G swap? [01:19:22] I thought they all had like, 0.5G swap or something [01:19:29] so the new instances have 0.5G swap? :) [01:20:02] Not the ones I created with the old scheme that had proper lvm. Dunno about the trusties though. [01:21:06] Coren: yeah, they all default to 0.5G Swap [01:21:19] that’s how they come by default [01:21:24] so we should add swap, I guess? [01:21:28] why swap anyway? [01:22:37] Specifically to avoid overcommiting. In practice, they aren't going to be used (we hope) but it's the only 100% sure way to protect against the OOM killer waking up. [01:22:53] hmmm [01:22:55] so [01:22:59] if a node has 0VMEM [01:23:03] will OGE schedule on to it? [01:23:06] even if usage itself is low? [01:23:27] No, because it can't possibly guarantee that the h_vmem can be satisfied. Hence the limit. [01:23:35] right [01:23:39] so I think we’re underusing those machines... [01:23:46] * Coren nods. [01:24:04] The right thing to do is add swap and increase the avaliable vmem by that amount. [01:24:05] I guess ‘just bump up VMEM’ is the wrong solution, and what we need to do instead is to add swap [01:24:07] yeah [01:24:10] let me file a task for that [01:25:13] At any rate, it's past 9pm here and I want to spend the few hours I can with my hubby during tax season. :-) [01:25:21] Talk to you tomorrow. o/ [01:26:18] Coren: sweet! :) thanks! [01:26:31] Coren: ugh, sorry, I wasn’t fully aware of that yet. I’ll adjust my ‘coren time’ clock appropriately [01:26:32] thanks! [01:36:30] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1192849 (10yuvipanda) It looks like only tools running on -tomcat node now are ones that were started with qsub and hence have no filters res... [02:24:01] PROBLEM - Puppet failure on tools-services-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:07:04] andrewbogott: don’t we ned a way to clean puppet and salt certs when instances are deleted if we’re going to use fqdn? [03:07:33] YuviPanda: probably. What cleans them now? [03:07:41] andrewbogott: nothing, I think. [03:07:50] which is ok because ec2ids don’t get used [03:07:55] but fqdns do all the time [03:08:03] reused, you mean? [03:08:05] this was also why I abandoned my earlier ec2id -> fqdn attempt [03:08:07] gah, yes [03:08:11] Hm... [03:08:17] what’s the risk, if one is reused? [03:08:24] puppet will fail until manually cleaned [03:08:30] Ah, right. [03:08:31] since it already has an accepted key for that name [03:08:33] same with salt [03:08:43] yeah, so we do need to clean them, you’re right. [03:09:23] I’m not sure I can think of how to do that, but I guess I’ll make a bug [03:09:41] andrewbogott: yeah. apergos said they might already have some code for a nova plugin that does this, which is what we need [03:09:56] a nova plugin that executes a clean on the master when an instance delete event happens [03:10:01] yeah, I can write it easliy enough if it doesn’t exist already. [03:10:05] yeah [03:10:11] \o/ cool [03:10:16] death to ec2ids! :) [03:10:32] although getting the plugin to run on virt1000 will be annoying [03:10:40] yeah... [03:10:55] anything with virt1000 feels annoying :) [03:10:55] I hope we get a new machine, upgrade to trusty, etc [03:10:58] I don’t suppose puppet or salt have a rest interface for cert management [03:11:06] afaik no... [03:11:13] the current scripts just shell out [03:11:25] Oh, that’s not the issue, just — right now those events are consumed by designate. So it’s easy to hook them on holmium. [03:11:38] Less so on virt1000, will have to produce another event for consumption elsewhere [03:11:43] ah, I see. [03:11:55] isn’t virt1000 the ‘controller’? shouldn’t the events be produced there? [03:12:14] They’re produced on the compute nodes. [03:12:28] virt1000 isn’t really involved. The api is on labnet1001... [03:12:36] virt1000 has the scheduler, but the scheduler doesn’t care about deletion [03:12:51] oh, I see. [03:13:02] And in any case we probably shouldn’t assume that the puppet and salt masters are… anywhere in particular. [03:13:04] * YuviPanda is only very, very vaguely aware of what part handles what in our OpenStack infra [03:13:06] right [03:13:59] RECOVERY - Puppet failure on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:14:47] 6Labs: Automatically clean salt and puppet certs on instance deletion - https://phabricator.wikimedia.org/T95519#1192985 (10Andrew) 3NEW a:3Andrew [03:21:39] YuviPanda: check it out: http://docs.puppetlabs.com/guides/rest_api.html [03:21:57] ooooo [03:21:59] andrewbogott: that’s awesome [03:22:04] I didn’t know about that at all [03:22:07] me neither [03:24:05] I can’t tell if there’s an equivalent for salt... [04:05:16] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1193030 (10yuvipanda) Wheee, it is puppetized and running on tools-services-01 now \o/ [04:06:10] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Make service manifest monitors redundant / hotswappable - https://phabricator.wikimedia.org/T95521#1193031 (10yuvipanda) 3NEW [05:30:45] 10Tool-Labs, 3Labs-Q4-Sprint-2: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1193065 (10scfc) 5Resolved>3Open It still needs to be puppetized. AFAICS, it's set in `qconf -ssconf`, and currently there is no function in the `gridengine` module for that.... [05:30:56] 10Tool-Labs, 3Labs-Q4-Sprint-2: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1193067 (10scfc) p:5Triage>3Low [06:40:14] 10Tool-Labs, 7Puppet: Develop and publish a gridengine provider for Puppet - https://phabricator.wikimedia.org/T95525#1193114 (10scfc) 3NEW [06:42:40] 10Tool-Labs: Error mails from SGE are encoded as application/octet-stream - https://phabricator.wikimedia.org/T63160#1193125 (10scfc) [07:12:19] Results (Found 2): putty, tunnel, [07:12:19] @search tunnel [07:12:22] !tunnel [07:12:22] ssh -f user@bastion.wmflabs.org -L :server: -N Example for sftp "ssh chewbacca@bastion.wmflabs.org -L 6000:bots-1:22 -N" will open bots-1:22 as localhost:6000 [07:17:38] !putyy [07:17:41] !putty [07:17:41] official site: http://www.chiark.greenend.org.uk/~sgtatham/putty/ | how to tunnel - http://oldsite.precedence.co.uk/nc/putty.html [08:20:07] PROBLEM - Free space - all mounts on tools-webgrid-04 is CRITICAL: CRITICAL: tools.tools-webgrid-04.diskspace._var.byte_percentfree.value (<85.71%) [08:23:00] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1193378 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org Da... [08:29:11] YuviPanda: Is there a way we can manaully one-time purge some graphite labs data? the integration.* queries are becoming completely useless because we've re-created out instances 4 times now and all the old data stays in there. The per-instance list is fixed by using wikitech API, but the 'cluster' overview can't do that. And listing instances maually exceeds the query string limit. [08:30:53] 10Tool-Labs: Set up a tileserver for OSM in Labs - https://phabricator.wikimedia.org/T62819#1193389 (10Nemo_bis) >>! In T62819#988323, @mxn wrote: > The old Toolserver tile server worked for all the Wikimedia languages, not just English, German, and Russian. Are there plans to expand the selection? I had to swit... [09:29:22] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193521 (10Krinkle) A better example from the new integration-slave-trusty-1010: {F110397} The first boot is fine. Then aft... [09:29:54] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193524 (10Krinkle) [09:41:42] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration: Diamond collected metrics about memory usage inaccurate until third reboot - https://phabricator.wikimedia.org/T91351#1193566 (10hashar) Looking at [[ https://wikitech.wikimedia.org/wiki/Atop | atop ]] history, there is nothing suspicious. I... [09:50:29] 10Tool-Labs: Error mails from SGE are encoded as application/octet-stream - https://phabricator.wikimedia.org/T63160#1193578 (10scfc) I've added: ``` # Debugging to see on which instances mailer is called by whom; T63160. --scfc id -u > $(mktemp /var/tmp/gridengine-mailer-called.XXXXXXX) ``` and deployed the... [09:59:06] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1193580 (10Krenair) 3NEW [09:59:39] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1193587 (10Krenair) Maybe a cron? [10:15:42] 6Labs, 10Tool-Labs: Delete 'commonsarchivebot' from toollabs - https://phabricator.wikimedia.org/T89807#1193619 (10scfc) a:3scfc Could you please go through and archive/delete the remaining files of the tool? Thanks! [10:17:28] 10Tool-Labs: Webservice restarts should be faster - https://phabricator.wikimedia.org/T85010#1193621 (10scfc) 5Open>3Resolved a:3coren [10:17:40] 10Tool-Labs: Webservice restarts should be faster - https://phabricator.wikimedia.org/T85010#936563 (10scfc) Fixed by T95485. [10:46:46] 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1193662 (10scfc) 3NEW a:3scfc [11:46:36] Does anyone know how up-to-date the database replicas on tools labs are? [11:46:45] Are they useful for displaying recent changes? [12:06:34] 6Labs, 10Tool-Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1193775 (10Qgil) What is the best way to send an invitation to the hackathon to the Labs / Tools Labs communities? We can plan for activities, but if t... [12:14:08] 6Labs, 10Tool-Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1193783 (10Multichill) >>! In T92274#1193775, @Qgil wrote: > What is the best way to send an invitation to the hackathon to the Labs / Tools Labs commu... [12:22:20] 6Labs, 10Tool-Labs-xTools: Create xtools project on Labs with domain xtools.wmflabs.org - https://phabricator.wikimedia.org/T88123#1193794 (10Technical13) [12:23:15] 10Tool-Labs, 10Tool-Labs-xTools: Web services continually restarting - https://phabricator.wikimedia.org/T71934#1193798 (10Technical13) [12:23:40] 10Tool-Labs, 10Tool-Labs-xTools: 10-minute load times on toolserver - https://phabricator.wikimedia.org/T76297#1193800 (10Technical13) [12:24:07] 10Tool-Labs, 10Tool-Labs-xTools: Tool labs web servers give 503's - https://phabricator.wikimedia.org/T59794#1193802 (10Technical13) [12:24:18] 10Tool-Labs-xTools, 10Wikimedia-Bugzilla: Create X!'s tools as component in Tool Labs tools - https://phabricator.wikimedia.org/T68977#1193804 (10Technical13) [12:29:23] 10Tool-Labs-xTools, 10Wikimedia-General-or-Unknown: "Web page not available" when clicking "Edit Count" - https://phabricator.wikimedia.org/T55742#1193808 (10Technical13) [12:30:48] 10Tool-Labs-xTools, 10Wikimedia-Labs-Infrastructure: Database upgrade MariaDB 10: Lock wait timeouts / deadlocks in a row - https://phabricator.wikimedia.org/T70753#1193811 (10Technical13) [12:31:52] 10Tool-Labs-xTools: Requesting all pages created creates a 502 Proxy Error - OOM? - https://phabricator.wikimedia.org/T61633#1193815 (10Technical13) [12:49:17] GoldenRing: They are usually a few seconds behind production. [12:50:01] 6Labs, 10Beta-Cluster, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1193865 (10hashar) 3NEW [13:12:53] 10Tool-Labs-tools-Morebots: Add pump.io (identi.ca) support to morebots - https://phabricator.wikimedia.org/T52109#1193935 (10Aklapper) p:5Normal>3Lowest [13:13:19] 10Tool-Labs-tools-Morebots: morebots (adminbot) doesn't reliably detect disconnects - https://phabricator.wikimedia.org/T52485#1193939 (10Aklapper) p:5High>3Normal [13:13:21] 10Tool-Labs-tools-Morebots: morebots needs a restart loop - https://phabricator.wikimedia.org/T61696#1193938 (10Aklapper) p:5Triage>3Normal [13:13:25] 10Tool-Labs-tools-Morebots: morebots depends on parameter "enable_twitter" having been set - https://phabricator.wikimedia.org/T63554#1193940 (10Aklapper) p:5Triage>3Low [13:14:50] legoktm: If you get a minute, I'd appreciate an extra pair of eyes on the python in https://gerrit.wikimedia.org/r/#/c/199267/ [13:15:04] (manage-snapshots is the python one) [13:32:24] Coren: Thanks. Also, I don't mean to be picky, but the text table seems to be missing from enwiki_p... [13:33:35] GoldenRing: The text table isn't actually populated on the production side either. For reasons of scaling, revision text is stored in a completely different way than for small mediawiki installs, and the only way to fetch them from Labs is via the API. [13:34:18] GoldenRing: The plus side is that because of the heavy caching we do getting text via the API is actually faster than if you had access to the external storage we use. [13:34:19] Thanks. I'll stop mucking around with the database and just use the API, then... [13:41:17] While I'm here, does anyone know if it's possible to use a python3 venv with the uwsgi webservice in tools labs? [13:42:26] ie. `virtualenv -p python3.4 ~/www/python/venv` [13:42:44] then somehthing to start the webservice. [13:48:00] I'm pretty sure it should be, but I'm not the python expert by a long margin. legoktm is your best bet for this, but many others on the channel might chime in. [14:09:42] Found this commit: https://github.com/wikimedia/operations-puppet/commit/f780d1161fc2c1eee55b10224c87c3e117cada18 [14:09:54] I guess that means no. [14:31:01] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:31:15] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:31:37] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [14:31:55] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [14:31:55] PROBLEM - Puppet failure on tools-exec-21 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [14:32:05] PROBLEM - Puppet failure on tools-static is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [14:32:23] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [14:33:03] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [14:34:59] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0] [14:35:17] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0] [14:35:29] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:36:53] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [14:37:08] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0] [14:40:34] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [14:42:50] PROBLEM - Puppet failure on tools-webgrid-04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [14:43:35] I’m looking at ^ but I’m on a slow connection so it’ll take a few minutes for me to get the git history [14:46:16] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [14:47:23] PROBLEM - Puppet failure on tools-exec-20 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:47:24] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [14:50:12] PROBLEM - Puppet failure on tools-exec-03 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [14:51:28] Sudo::Group[ops] is already declared in file /etc/puppet/manifests/role/labs.pp:12 [14:51:31] cannot redeclare at /etc/puppet/modules/admin/manifests/group.pp:39 [14:51:34] hehe what a mess [14:52:00] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [0.0] [15:00:43] failed to bind to LDAP server ldap://ldap-eqiad.wikimedia.org:389: Connect error [15:00:44] doh [15:00:47] dns is borked [15:01:17] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1194373 (10coren) So, after some testing, I believe I found an order of operation which doesn't break too many things and allows us to phase out idmap entirely: * make certain... [15:06:32] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1194393 (10coren) 3NEW a:3coren [15:08:26] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precice instances - https://phabricator.wikimedia.org/T95555#1194404 (10coren) 3NEW a:3coren [15:09:24] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1194417 (10coren) 3NEW a:3coren [15:09:47] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Do a rolling restart of Tool Labs precise instances - https://phabricator.wikimedia.org/T95557#1194426 (10coren) 3NEW a:3coren [15:11:01] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Remove dependencies on LDAP from labstore100[12] - https://phabricator.wikimedia.org/T95558#1194437 (10coren) 3NEW a:3coren [15:11:51] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Disable LDAP and enable admin puppet module on labstore100[12] - https://phabricator.wikimedia.org/T95559#1194445 (10coren) 3NEW a:3coren [15:15:04] 6Labs, 10Wikimedia-Labs-Infrastructure: Setting hiera role::puppet::self::master prevents instances provisioning - https://phabricator.wikimedia.org/T95560#1194464 (10hashar) 3NEW [15:15:43] hashar: the puppet issue is resolved now… want to build yourself a new instance and see if that works better? [15:16:33] 6Labs, 10Wikimedia-Labs-Infrastructure: Setting hiera role::puppet::self::master prevents instances provisioning - https://phabricator.wikimedia.org/T95560#1194477 (10hashar) [15:16:45] andrewbogott: I think I have an egg and chicken issue when using puppetmaster self [15:16:58] will trash it and respawn one to verify :) [15:17:18] hashar: I’m about to fall off the internet, but I will read the bug when I can [15:17:29] andrewbogott: sure :) have a good bus ride! [15:17:39] If it involves hiera and/or the ENC then Yuvi may be the one to ask [15:17:45] yeah I guess [15:18:09] the instance is prepared with virt1000 which would most probably switch the puppetmaster [15:18:19] bus ride is almost over but I have more stops before I find reliable wifi [15:18:21] but resolv.conf might not have been applied yet or some dns is borked. Will see [15:18:28] I am sure yuvi will figure it out :) [15:18:32] I am about to leave anyway. [15:18:41] hashar: oh, so this is a case where you’re setting the puppet config for an instance before the instance is built [15:18:57] That involves a million new races that we have never debugged [15:18:59] :( [15:19:06] 6Labs, 10Wikimedia-Labs-Infrastructure: Setting hiera role::puppet::self::master prevents instances provisioning - https://phabricator.wikimedia.org/T95560#1194494 (10hashar) + @yuvipanda since he knows about puppetmaster::self / hiera / ENC etc :) [15:19:12] andrewbogott: I guess :) [15:20:39] It’s probably not so much a ‘fix this one bug’ thing as a ‘fix everything about labs instance start up’ thing. Probably we’ll need to build new images to get anything reliable to happen [15:20:39] might be that in the short run we need different base images for old vs. new dns [15:21:08] * andrewbogott flees [15:26:23] I *really* hate that the "add member" screen in OpenStackMangager doesn't autocomplete [15:26:49] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [15:31:24] RECOVERY - Puppet failure on tools-webproxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:32:24] RECOVERY - Puppet failure on tools-exec-20 is OK: OK: Less than 1.00% above the threshold [0.0] [15:32:24] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [15:35:17] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:36:11] RECOVERY - Puppet failure on tools-webproxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:36:52] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [15:40:21] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [15:40:33] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:42:09] RECOVERY - Puppet failure on tools-static is OK: OK: Less than 1.00% above the threshold [0.0] [15:42:23] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [15:43:05] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:45:01] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:35] Krinkle: re graphite data - sure. File a bug and I'll hand delete all the metrics you want :) [15:54:31] YuviPanda: I looked at the old archive script (still in puppet), looks like it only deleted instances, not metrics. [15:54:38] ? [15:55:16] Metrics are just files on disk. Rm -rf kills them [15:56:36] YuviPanda: I mean, some instances continue to report a '/var' that is at an emergency level [15:56:38] 6Labs, 10Wikimedia-Labs-Infrastructure: Setting hiera role::puppet::self::master prevents instances provisioning - https://phabricator.wikimedia.org/T95560#1194673 (10hashar) 5Open>3Resolved a:3hashar @andrew told me there was a puppet issue. I have recreated the instance and it works just fine! [15:56:40] even though that mount no longer exists [15:56:51] andrewbogott_afk: my instance recreated just fine. Thanks! ( was https://phabricator.wikimedia.org/T95560 ) [15:56:54] but something keeps echoing that metric around [15:56:58] instead of it dropping to 0/null [15:57:06] This happens to other things as well [15:57:09] I don't know what does that [15:57:29] usually when there is an outage the metric has a gap. in production thath appens [15:57:41] but in labs metrics never have a gap, instead it repeats the last value [15:57:46] Hi, I have problems to write my CDB files to disk http://drmf-beta.wmflabs.org/ is there any know issue about that? [15:58:52] my first guess were permission problems but I changed the directory to 777 and www-data owner [16:04:16] Krinkle: yeah that is txstatsd being terrible [16:04:39] Which is also why that archiver script had to be killed [16:04:45] YuviPanda: Interesting. [16:05:14] YuviPanda: Removing stats of deleted integration.* instances would fix 80% of the annoying data at the moment. That'd be cool if you could do that. [16:05:27] Yup. Can you file a bug? [16:05:32] A separate one? [16:05:33] OK [16:05:42] I'll get to it in a moment. Just woke up and in bed [16:05:46] Phone IRC [16:05:51] cool [16:08:34] 6Labs, 10Continuous-Integration: Purge graphite data for deleted integration instances. - https://phabricator.wikimedia.org/T95569#1194719 (10Krinkle) 3NEW [16:22:48] thats strange even mwscript rebuildLocalisationCache.php can not create the file [16:23:25] 6Labs, 10Beta-Cluster, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1194794 (10Dzahn) http://ubuntuforums.org/showthread.php?t=802156 tldr: bad proxies sudo aptitude -o Acquire::http::No-Cache=True -o Acqui... [16:25:39] 6Labs, 10Continuous-Integration: Purge graphite data for deleted integration instances. - https://phabricator.wikimedia.org/T95569#1194807 (10yuvipanda) Should I just delete all the data under the integration project, and let it start again from scratch? [16:36:32] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [16:37:36] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [16:38:07] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [16:38:52] RECOVERY - Puppet failure on tools-webgrid-04 is OK: OK: Less than 1.00% above the threshold [0.0] [16:41:47] !log extdist deleted giant /var/log/extdist from extdist3 [16:41:49] Logged the message, Master [17:02:54] Eeew. I was hoping there would be fewer things owned by uids not in LDAP [17:03:15] how bad is it? :) [17:05:39] By filecount, pretty bad, it's shaping up to be several 100s of thousands of files. But they are very clustered (like, whole trees) and fixing it shouldn't be so hard. For the most part, it looks like system daemons though which may complicate things a bit. I only see three afftected projects to date though. [17:06:09] (I'm only about 30% done walking the tree though) [17:07:02] wikidata-query has mysql things owned by the apt-managed uid (that's going to be a pain). [17:07:27] on NFS? [17:07:37] 10Wikibugs, 10grrrit-wm: fix grrrit-wm channels - https://phabricator.wikimedia.org/T95578#1194962 (10valhallasw) 3NEW [17:07:40] anything on tools? [17:08:02] account-creation-assistance looks like it has stuff owned by... maybe www-data? Switching it to the puppet-managed apache user shouldn't be too hard. [17:08:04] 10Wikibugs, 10grrrit-wm: fix grrrit-wm bug channels - https://phabricator.wikimedia.org/T95578#1194969 (10valhallasw) [17:08:13] YuviPanda: I haven't reached tools yet. [17:08:19] aaah, heh [17:08:32] Coren: www-data comes by default with ubuntu, I think [17:08:34] and has a set uid [17:08:39] so that should be kept as is, IMO [17:08:41] (www-data) [17:08:53] we need to let uids that come set by ubuntu be set, I think [17:08:57] Does it come by default or it it managed by apt? [17:08:59] (so root is fine, for eaxmple) [17:09:25] Because if it comes from apt then the actual uid is dependent on package install ordering. [17:09:53] Yeah, some may be okay regardless - I'm picking out outliers atm not deciding what to keep or fix. :-) [17:09:54] Coren: I think it comes with base [17:10:11] it’s on tools-bastion-01 and that’s never had apache :) [17:10:13] and it’s also 33 [17:11:58] logstash has a lot of use of uid 110, but it /looks/ like that's all in a backup of some sort. [17:12:57] I'll know more in a while; this is taking some time to crunch. *lots* of files to stat(). :-) [17:13:19] heh true [17:13:50] Coren: wanna take a look at https://gerrit.wikimedia.org/r/#/c/203110/ in the meantime? I’m building in failover to the services from the start [17:14:56] No problem, looking at it now. [17:16:24] One of the nice things about service manifests now is that they’ll not just be kept running but also be started if they aren’t running [17:16:53] 10Quarry: Quarry sorts by the first column by default despite and ORDER BY clause - https://phabricator.wikimedia.org/T95369#1195015 (10Amire80) [17:18:42] 10Quarry: Quarry sorts by the first column by default despite and ORDER BY clause - https://phabricator.wikimedia.org/T95369#1187983 (10Amire80) [17:18:46] 10Quarry: Quarry does not respect ORDER BY sort order in result set - https://phabricator.wikimedia.org/T87829#1195024 (10Amire80) [17:22:33] 10Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1195035 (10Amire80) 3NEW [17:36:00] 6Labs, 10Beta-Cluster: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195122 (10yuvipanda) 3NEW [17:41:33] 6Labs, 10Beta-Cluster, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195176 (10greg) p:5Triage>3Unbreak! [17:45:38] Why might a wikipybot task get stalled indefinitely? I noticed today that my bot hadn't been active since April 3, and when I checked on labs, the log was full of reports that the job was already running... since April 3. I killed [17:45:49] it and then the bot started working again. [17:48:30] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195235 (10Dzahn) The error message comes from mediawiki / extensions/EventLogging includes /EventLoggingHooks.php line 33 " wfDebugLog( 'EventLoggi... [17:49:44] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195246 (10Dzahn) So i guess either there should be EventLogging and it needs config, or there shouldn't then the extensions should be removed. [17:51:50] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195268 (10Krenair) I thought MediaWiki was not supposed to be running there anymore? [18:00:25] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195332 (10Dzahn) So the receiving end of it it is configured to expect stuff from virt1000 but because mw is gone now these show up on fluorine?? [18:09:49] PROBLEM - Puppet failure on tools-services-02 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [18:11:16] andrewbogott: puppetmaster::self works just fine on new instances since you went out of bus :) [18:12:26] hashar: interesting! [18:14:45] RECOVERY - Puppet failure on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:16:13] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195409 (10Andrew) I wipe out apache's crontab on virt1000. Did that fix everything? [18:25:19] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195481 (10Krenair) 5Open>3Resolved a:3Krenair Looks like it. Thanks. [18:25:33] 6Labs, 6operations: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195484 (10Krenair) a:5Krenair>3Andrew [18:28:17] 6Labs, 10Continuous-Integration: integration labs project DNS resolver improperly switched to openstack-designate - https://phabricator.wikimedia.org/T95273#1195498 (10hashar) The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the maste... [18:28:41] 6Labs, 10Beta-Cluster, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195122 (10hashar) The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the master as being the master... [19:24:50] 6Labs, 10Beta-Cluster, 7Puppet: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195675 (10hashar) 5Open>3Resolved a:3hashar Ok solved! That was the exact same issue as on integration and staging project. Changing the hostname cause the puppetmaster... [19:37:21] Hey all, I'd like to install Moses Decoder (http://www.statmt.org/moses/) on a labs instance to explore machine translation options. What do I need to do? Any documentation you can point me to? Thanks. [19:39:01] bmansurov: That's a very wide question, the answer to which depends a lot on how much system administration you feel comfortable doing yourself and what the actual requirements are. [19:40:20] Coren: thanks, I'm comfortable with sysadmin. I'm unaware of how things work on labs. I'd like to know if I can create an instance myself, or if I get an access to some instance? [19:41:24] Coren: I've installed the abovementioned software locally, but I can't share my results with my teammates, thus I'd like to install it on labs [19:42:54] You can create instances if you have a project, or are a member of a project to create them in. I don't think there exists a project were what you want to try is within scope, so you might want to create one. You can request the creation of one by creating a subtask to https://phabricator.wikimedia.org/T76375 [19:43:40] Once your project is created, you'll be able to create instances within it and set them up as you need them. [19:51:29] Coren: thank you [19:57:51] 6Labs, 6Language-Engineering: Create a Language Engineering labs project - https://phabricator.wikimedia.org/T95609#1195773 (10bmansurov) 3NEW [19:58:23] 6Labs, 6Language-Engineering: Create a Language Engineering labs project - https://phabricator.wikimedia.org/T95609#1195782 (10bmansurov) [19:58:25] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1195781 (10bmansurov) [20:02:11] 6Labs, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation: Create a Moses translation labs project - https://phabricator.wikimedia.org/T95609#1195801 (10Amire80) [20:10:26] 6Labs, 10Beta-Cluster, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195824 (10Dzahn) 5Open>3Resolved a:3Dzahn fixed with method 2: ``` # apt-get clean # cd /var/lib/apt # mv lists lists.old # mkdir -p... [20:11:39] !log deployment fixed apt sources lists on deployment-bastion (T95541) [20:11:39] deployment is not a valid project. [20:11:57] !log deployment-prep fixed apt sources lists on deployment-bastion (T95541) [20:12:00] Logged the message, Master [20:16:20] 6Labs, 10Beta-Cluster, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195841 (10Dzahn) root@deployment-bastion:~# apt-key list | grep -B1 ftpmaster pub 1024D/437D05B5 2004-09-12 uid Ubuntu Ar... [20:19:07] notices on deployment-prep that a package upgrade would upgrade libc6 and php5 and stuff... [20:19:16] but it would also downgrade salt [20:19:19] doesn't run it [20:20:03] libc6 _and_ php5 .hmmm [20:21:13] 6Labs, 10Beta-Cluster, 6operations: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195853 (10hashar) Thanks a ton @dzahn for the fix, the reference and the detailed step by step instructions! [20:36:47] bmansurov: I note there is already a project named "language" that is described as "Various things for the language engineering team. For instance, MLEB testing, CX server." that seems to be administered by Amire80 inter alia. Might I suggest that this might be a good place for your instance? :-) [20:37:11] Coren: that sound perfect [20:37:19] 6Labs, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation: Create a Moses translation labs project - https://phabricator.wikimedia.org/T95609#1195908 (10coren) Would not the 'language' project to wich @amire80 is admin do for those tests? [20:37:21] Coren: should I contact Amire80 to get access to it? [20:37:39] *s [20:38:07] bmansurov: Any of the projects admin would do. I think much of the language engineering team is there already. :-) [20:38:20] Coren: great, thanks! [20:38:29] (Which, when you think about it, isn't all that odd for a project named 'language') :-) [20:38:39] ;) [20:40:02] andrewbogott: It occurs to me that, since we have the global proxy now, it might make sense to have the default IP quota for new projects at 0? [20:40:40] Isn’t it? [20:40:59] I thought it was still 1? [20:41:11] Admitedly, I didn't actually check. [20:42:23] quota-defaults says 0 [20:43:22] Clearly then what I was saying made sense. :-) [20:43:29] yep! [20:44:20] 6Labs, 7Tracking: MediaWiki Extension "ImportArticles" project - https://phabricator.wikimedia.org/T89208#1195938 (10coren) 5Open>3Resolved a:3coren Created a project named 'importarticles' and set @cblair91 as admin. [20:44:22] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1195941 (10coren) [21:11:08] YuviPanda: would you be able to grant me the requisite access and sudo privs to achieve https://wikitech.wikimedia.org/wiki/EventLogging/Testing/BetaLabs ? i'm trying to verify event logged stuff from the beta cluster [21:12:16] dr0ptp4kt_cold: yes but I'm not near my computer now. Lots of other people.can too, however. We have plenty of deployment prep admins [21:12:25] Including jdlrobson [21:12:54] YuviPanda: cool, i'll see if he can give a hand [21:13:09] :) [21:27:52] 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1196095 (10Andrew) 3NEW a:3Cmjohnson [21:47:22] 6Labs, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation: Create a Moses translation labs project - https://phabricator.wikimedia.org/T95609#1196137 (10Amire80) 5Open>3Resolved a:3Amire80 OK, for now we created an instance in the existing project. [21:47:25] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1196140 (10Amire80) [21:49:50] PROBLEM - Host tools-services-02 is DOWN: CRITICAL - Host Unreachable (10.68.18.35) [21:50:33] don’t worry! [21:50:35] that’s me! [21:55:26] RECOVERY - Host tools-services-02 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [22:10:48] PROBLEM - Puppet failure on tools-services-02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [22:14:57] 6Labs, 10Tool-Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1196231 (10scfc) The bastion hosts run Ubuntu Trusty and setting motd on those is currently blocked by T85307. I would suggest sending an invitation t... [22:15:44] 6Labs, 10Tool-Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1196233 (10scfc) … or rather `labs-announce`. [22:15:46] RECOVERY - Puppet failure on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:28:24] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1196290 (10RobH) So, I think that one of the old lsearch servers would work for this: wmf3152 Dell PowerEdge R410, Dual Intel Xeon X5650 (2.66 GHz), 48GB Memory, (2) 150GB Dis... [22:28:41] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1196293 (10RobH) a:5RobH>3Cmjohnson Advise, and assign back to me, thanks! [23:01:12] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1196377 (10scfc) [23:01:15] 6Labs, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation: Create a Moses translation labs project - https://phabricator.wikimedia.org/T95609#1196375 (10scfc) 5Resolved>3Invalid Then let's not mark it as "resolved" :-). [23:19:30] 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196413 (10scfc) [23:19:39] PROBLEM - Host tools-webgrid-04 is DOWN: CRITICAL - Host Unreachable (10.68.17.174) [23:29:15] YuviPanda: is tools-webgrid-04 you? [23:29:40] oh, nm, I just had to read backscroll [23:54:25] PROBLEM - Puppet failure on tools-webgrid-08 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:56:14] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/2ffbd6f11a78716e3fc2d4533b4e892489c7477d [23:56:14] 13nagf/06master 142ffbd6f 15Timo Tijhof: Update graphite properties following rename at Wikimedia [23:57:46] wikimedia/nagf#25 (master - 2ffbd6f: Timo Tijhof) The build passed. - http://travis-ci.org/wikimedia/nagf/builds/57885373