[00:00:26] yeah, no one could get in [00:00:42] Coren or andrewbogott are the only hope now, I think [00:01:16] RoanKattouw: I can look, give me 10 or so [00:13:40] RoanKattouw: I have a root login on deployment-parsoidcache01… anything in particular you'd like me to look for? [00:14:21] andrewbogott: Well first of all, why can't I ssh into it from deployment-bastion [00:14:57] Second, why is it refusing connections on port 80? Is Varnish not running or something? [00:15:34] Just now I tried connecting to 80 (conn refused), SSH (auth error) and ping (worked) [00:18:24] mind if I reboot it? [00:18:40] puppet is failing in an unfamiliar way. [00:18:47] Warning: Unable to fetch my node definition, but the agent run will continue: [00:19:38] RoanKattouw: ? [00:19:43] It is using puppetmaster::self [00:19:47] But that shouldn't have magically broken [00:23:17] It's probably OK for it to not use puppetmaster:self I suppose? [00:23:21] I'll try that [00:23:59] "The last Puppet run was at Sat Jun 21 16:17:26 UTC 2014" [00:24:01] No, don't change that! [00:24:05] that will break the instance for good [00:24:06] OK [00:24:09] * RoanKattouw stops touching thing [00:24:11] s [00:24:14] I haven't touched it yet [00:24:18] 'k [00:24:22] OK well if the last puppet run was in June, that's probably bad? [00:24:25] I rebooted, we'll see if it's calmer now [00:24:31] yeah, that means it's been broken for ages. [00:33:07] huh [00:33:31] hi roan [00:34:34] I just assumed it was part of deployment-prep because it was part of deployment-prep [00:35:03] I rebooted it before andrewbogott did. so idk where June came feom [00:35:07] feom* [00:35:17] from* damnit [00:36:03] just 'cause it rebooted doesn't mean that puppet ran [00:36:14] right [00:36:48] don't remember how far I got before I left [00:37:07] anyway, how did it get into deployment-prep? [00:37:51] I don't understand your question [00:38:11] why wouldn't it be in deployment-prep? [00:38:19] I have to go in a moment, but here's what the failure looks like: https://dpaste.de/mtvH [00:38:37] The master says that the cert has expired, but there is no cert to expire. I've cleared certs on the host and the master. [00:39:45] nvm, I misread something [00:40:49] I accidentally revoked all on that master today. didn't fix parsoidcache because I couldn't log in [00:41:09] which is why I booted it [00:42:55] oh [00:43:01] well, since I can log in, what would you suggest I do? [00:43:08] (Although clearly it didn't work, even pre-revoke) [01:03:50] RoanKattouw_away: now in _security you're talking about a different box and I'm confused [01:16:43] Well, I have to go. Something is broken with ssl between that box and deployment-salt, but I don't know what it is… it isn't getting as far as logging anything on the puppet master. [01:32:51] andrewbogott_afk: Sorry, I was telling Ori about one box while simultaneously pointing out that the other box in front of it was broken [01:33:27] andrewbogott_afk: Sorry for the confusion, and sorry I wasn't there, I had to run an errand [01:34:23] I'll try to see if I can spin up a different VM in its place, which Should Be Easy (TM). If it turns out not to be easy, I'll just tell people that it's broken because of OpenStack mysteries [02:00:16] andrewbogott_afk: OK well that went nowhere, because I can't even CREATE instances. It just says "Failed to create instance." [02:02:19] 3Wikimedia Labs / 3wikistats: Add sourceforge farm - 10https://bugzilla.wikimedia.org/58396#c1 (10Daniel Zahn) some use "/wiki/" and some use "/mediawiki/"? really? [02:02:27] andrewbogott_afk: I tried to delete the broken instance to make room, but that didn't help either. Sigh. [02:14:29] 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster api.php, index.php, load.php return 404 (caused failed browser tests) - 10https://bugzilla.wikimedia.org/70648#c3 (10spage) 5NEW>3RESO/WOR I think beta labs is working now (thanks to the work of jeremyb and others), though I think this warrants a po... [02:21:23] RoanKattouw: You probably guessed it, but "Failed to create instance" usually means out of quota. [02:21:59] scfc_de: Yeah so I removed the broken instance to try to make room for it, and I got the same error again [02:22:20] So either that project is still over quota (how?!) or something else is going on [02:26:33] RoanKattouw: You should be able to see the current quota at https://wikitech.wikimedia.org/wiki/Special:NovaProject => "Display Quotas". [02:27:31] RoanKattouw: your quota is 41 instances [02:27:36] in deployment-prep [02:28:18] You also have to look at CPUs & stuff. [02:29:36] !log deployment-prep raised instance quota by 1 to 42 [02:29:39] Logged the message, Master [02:31:34] Yay now it let me create the instance [03:00:59] 3Wikimedia Labs / 3wikistats: Add sourceforge farm - 10https://bugzilla.wikimedia.org/58396#c6 (10Daniel Zahn) actually, _after_ adding the code above i now see there is no Mediawiki in use anymore? do you still actually see any Mediawiki api.php link on sourceforge? sigh? [03:32:44] 3Wikimedia Labs / 3wikistats: New Hive - 10https://bugzilla.wikimedia.org/70309#c5 (10Daniel Zahn) 5PATC>3RESO/FIX http://wikistats.wmflabs.org/display.php?t=or [04:20:44] 3Wikimedia Labs / 3wikistats: Add sourceforge farm - 10https://bugzilla.wikimedia.org/58396#c7 (10Nemo) (In reply to Daniel Zahn from comment #6) > do you still actually see any Mediawiki api.php link on > sourceforge? The api.php URLs with own subdomains still work. A quick search didn't find anything for... [10:07:58] 3Wikimedia Labs / 3deployment-prep (beta): Determine first pass list of icinga-alerting data from graphite.wmflabs - 10https://bugzilla.wikimedia.org/70141#c9 (10Yuvi Panda) There now exists monitoring for puppet failures and disk space (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=la... [10:13:44] 3Wikimedia Labs / 3deployment-prep (beta): Determine first pass list of icinga-alerting data from graphite.wmflabs - 10https://bugzilla.wikimedia.org/70141#c10 (10Yuvi Panda) Note that the alert are for all the machines, in betalabs, not just for the ones listed. I added more features to our check_graphite s... [11:23:45] 3Wikimedia Labs / 3deployment-prep (beta): Determine first pass list of icinga-alerting data from graphite.wmflabs - 10https://bugzilla.wikimedia.org/70141#c13 (10Yuvi Panda) Also, who is responsible for fixing the errors that pop up? There are puppet failures on videoscaler-01 now, and I've no idea how to f... [11:27:17] 3Wikimedia Labs: remove puppetmaster class and variables from global puppet groups - 10https://bugzilla.wikimedia.org/70708 (10Filippo Giunchedi) 3NEW p:3Unprio s:3trivia a:3None looks like it can live in project-specific groups and makes it more confusing with role::puppet::self vs puppetmaster on whi... [11:30:33] hi [11:31:45] 3Wikimedia Labs / 3deployment-prep (beta): Determine first pass list of icinga-alerting data from graphite.wmflabs - 10https://bugzilla.wikimedia.org/70141#c14 (10Yuvi Panda) (On that note, I'd also remove myself from the alert groups once the initial setting up is stabilized) [11:32:52] I would like to create an other shell account name (the current name does not satisfy me), but I already have a shell account. Could I recreate an other access request without any trouble? [12:15:38] Continuing from my yesterday queries : ). I wanted to test my exim/puppet patch https://gerrit.wikimedia.org/r/#/c/155753/ in labs. I created a new instance and enabled the role role::puppet::self []. I gave a sudo puppetmaster start, and later did a sudo puppet agent -tv and applied the patch ( git fetch .. && git checkout ) in [12:15:38] /var/lib/git/operations/puppet and did gave a sudo puppet agent -tv expecting the change to applied to my exim. can someone confirm that the steps sounds good for applying a puppet patch ? [12:19:51] Coren: what do you think about a new component in bugzilla for labs replication database problem? using the infrastructure component pings ryan and tim which are the wrong people in this case and sean is missing per default [12:21:23] i am currently writing a report for replication error at dewiki db th [12:21:44] Merlissimo: Sounds sane. [12:30:29] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556 (10merl) [12:30:30] 3Wikimedia Labs / 3(other): (Tracking) Database replication services - 10https://bugzilla.wikimedia.org/48930 (10merl) [12:30:32] 3Wikimedia Labs / 3Infrastructure: missing database entries at categorylinks table on dewiki db - 10https://bugzilla.wikimedia.org/70711 (10merl) 3NEW p:3Unprio s:3major a:3None My bot searches for articles without categories on dewiki. But the database returns wrong results because of missing entrie... [13:04:14] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497#c7 (10Yuvi Panda) There's now alerts for the following things for betalabs: - Low space on /var - Low space on / - Puppet staleness (warn at 1h, crit at 12h) - Puppet failure events Note th... [13:44:59] YuviPanda: how will I enable a class role::mail::mx on my instance ? mark thinks that that should be the problem our patch was not getting applied. ie we were not having a mailserver role applied on the instance. [13:45:23] do we have something in the Special:::NovaInstance&action=configure to get that role enabled ? [13:45:40] tonythomas: simplest way is to do 'include role::mail::mx' under node default { in manifests/site.pp] [13:46:08] YuviPanda: let me try that. [13:46:52] tonythomas: or whatever the class name of the role you want [13:47:23] I think I never pasted that under node default yesterday ! [13:47:46] a [13:47:47] h [13:56:36] YuviPanda: this is what I get, it seems again Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declarati [13:56:37] on: Class[Exim4] is already declared in file /etc/puppet/manifests/mail.pp:65; cannot redeclar [13:56:37] e at /etc/puppet/manifests/role/mail.pp:5 on node mediawiki-verp.eqiad.wmflabs [14:11:29] tonythomas: that sounds like a bug in the code, or us using the wrong role [14:11:41] tonythomas: need to poke whoever knows about our mail servers to see what's wrong [14:13:14] YuviPanda: looks like. hope it gets fixed soon [14:14:31] ok [14:14:42] * Coren reads scrollback. [14:15:51] tonythomas: You're trying to include the same class twice, in two different places. Once in mail.pp and the other one in role/mail.pp. Is either of those yours? [14:19:53] Coren: you've been added to the toollabs alert group, btw :) [14:20:03] scfc_de: ^ I should add you as well, I think. [14:24:26] The standard labs role includes at least one mail class. [14:25:10] YuviPanda: That's what I was about to ask. Does Icinga use the existing LDAP user (how?), or do I need to add an entry to modules/admin/data/data.yaml (without any groups, of course)? [14:25:44] scfc_de: need to be added separately (email/uname) in private repo. I'm unsure how since I don't havea ccess to private repoi [14:26:17] YuviPanda: Okay, then I leave it all to you :-). [14:26:31] scfc_de: :) I suppose you'll prefer alerts in the same email you use on gerrit? [14:26:32] (Because I don't even know there is a private repo :-).) [14:26:38] andrewbogott: do you know how to add people to the private repo? [14:26:48] YuviPanda: I use tim@tim-landscheidt.de for everything. [14:30:36] tonythomas: That's probably it (what andrewbogott said); you're conflicting with the default mail class all labs instances get. You'll have to remove that one for your tests. [14:34:40] ok, scfc_de, does this look ok? https://dpaste.de/f7wB [14:37:26] andrewbogott: Yes; but I use "scfc_de" only on IRC. My shell name is just "scfc". I don't know what's the convention there. [14:37:52] Mostly I'm trying to get enough info in there that someone will know who you are if they read. [14:38:06] good you have IRC nick in there someplace :) [14:38:39] scfc_de: andrewbogott I submitted https://gerrit.wikimedia.org/r/#/c/159731/ [14:40:00] scfc_de: ok, this is all merged. Prepare to regret! [14:40:36] andrewbogott: I suppose we should just wait for icinga to pick up the slew of changes... [14:40:40] I wonder why it takes a while [14:40:51] YuviPanda: I think that puppet has to run on the client, and then on the master. [14:40:56] ah [14:40:59] let me force one on labmon [14:41:00] Or maybe even on the master and then client and then master again? Anyway it takes a while :) [14:41:01] andrewbogott: I have an efficient mail filter :-). [14:41:14] 3Wikimedia Labs / 3Infrastructure: List of SVN users who were not migrated - 10https://bugzilla.wikimedia.org/58687#c7 (10Nemo) (In reply to Tim Landscheidt from comment #6) > Nemo, could you rephrase what list you are looking for, or close this bug? A list of wikimedia SVN users, minus those which have an... [14:42:59] andrewbogott: can you force one on neon? I just forced one on labmon [14:43:06] yep, ahead of you [14:43:24] heh [14:43:53] YuviPanda: "scf*c*_de" :-). [14:44:22] scfc_de: gosh, I've always thought / read it as scfe_de :| [15:01:51] scfc_de: can you help with monitoring OGE? I'm trying to figure out what things to collect so I can monitor [15:03:00] YuviPanda: In Ganglia days, we graphed jobs running/queued/in error. I would set an alert for ${in error} > 0. [15:03:17] scfc_de: how do I monitor running / queued? [15:03:21] qstat what params? [15:03:31] the default diamond collector checks qstat -g c [15:03:46] I don't actually know what the output means [15:04:16] YuviPanda: operations/puppet:modules/gridengine/files/grid-ganglia-report [15:04:32] ah [15:04:33] interesting [15:04:35] * YuviPanda reads [15:05:33] Load would be interesting as well, *but* that should probably be monitored for all instances. [15:05:43] scfc_de: yeah, there's a loadavg plugin we can enable [15:06:08] scfc_de: I dunno if that normalizes to per core already or not [15:06:46] loadavg plugin = diamond plugin for grid or for all instances? [15:06:52] all instances [15:07:10] scfc_de: https://github.com/BrightcoveOS/Diamond/blob/master/src/collectors/loadavg/loadavg.py [15:09:02] scfc_de: I suppose it's not *that* useful without per core normalization [15:14:57] scfc_de: I did a manual check, don't think new instances are needed right now. Load seems ok. [15:16:01] Load is /such/ a sucky metric. You want 100%-idle CPU and IOWait. [15:25:35] Coren: hmm, we're already capturing iowait [15:25:44] Coren: I suppose I can setup alerts based on that [15:25:46] Is good. :-) [15:33:18] YuviPanda: The problem with load (or 100%-idle CPU and IOWait) isn't that it is too large at /this/ moment :-). We had tools-gift* running amok in the past, and that only happened once a month or so. tools-login is very unpredictable as well. (As Puppet will fail then, there is already an alert for this :-), but something explicit would be nice.) [15:33:33] yeah [15:33:50] I wonder what metric we should use for CPU usage [15:35:29] scfc_de: I suppose iowait, user, system > 70% for 5mins out of last 10min? [15:36:42] Any alarm should require some integrated average, not instantaneous value. [15:36:56] (Well, except for simple pass/fail tests, obviously) [15:37:05] Coren: indeed, and the graphite checks support that [15:37:14] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497 (10Željko Filipin) [15:37:18] so it would be over 70% for 5 of the last 10 checks [15:37:23] Coren: remove the default mail class ? that would be like removing the inclusion from sites.pp ? [15:37:27] (where each 'check' is sent by diamond once in a minute) [15:38:57] tonythomas: Well, you need to figure out where it's included /from/. I'm guessing what is biting you right now is role::labs::instance, where role::mail::sender is included. Removing that (or commenting it out) might be all you need. [15:40:39] yup. the default have a include role::mail::sender in standard class and an import mail.pp in site.pp [15:41:30] tonythomas: Removing those should work for your testing. Onbviously, in "real life", you'd modify the mail class(es) instead. [15:42:12] the import too can be removed right ? [15:43:10] I expect so. [15:48:45] scfc_de: did you get alerts? :) [15:49:35] uh, that's weird [15:52:26] scfc_de: Coren 800M /var/spool in tools-master [15:52:28] unsure what's causing that [15:53:02] YuviPanda: Yep, got the alerts. [15:55:00] scfc_de: 770M messages in /var/spool/gridengine/ [15:55:15] I suppose that can be deleted [15:55:47] Coren: on removing the import 'mail.pp' and adding my new role::mx under default node, I get a new error Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: Could not find declared class exim::roled at /etc/puppet/manifests/role/mail.pp:60 on node [15:55:48] mediawiki-verp.eqiad.wmflabs [15:56:06] so the import 'mail.pp' must be there somehow ? [15:56:09] yeah [15:56:13] tonythomas: you shouldn't remove the import [15:56:15] just the include [15:57:49] Coren: ^ [15:57:51] err [15:57:58] for the messages file on tools-master [16:02:48] YuviPanda: That file seems to be rotated (purged?) every month? So no problem? [16:03:00] (Line 1 = "08/25/2014 22:07:08".) [16:03:05] scfc_de: it should be in /var/log tho, since it looks like a log [16:03:07] right [16:03:23] Upstream :-). [16:03:28] Coren: YuviPanda this is the diff of the change I did, and still yet I get a duplicate declaration error https://dpaste.de/FjZC [16:03:42] I think there is someother config thats including the mail::sender somehow [16:03:53] as its still listed in states/classes.txt [16:04:01] I have a mail::sender there [16:04:09] scfc_de: heh [16:15:16] YuviPanda: Could you acknowledge/silence the alarm for Puppet failures on -exec-12? (I think Icinga has a button for that.) [16:15:48] scfc_de: you should have icinga access too? [16:21:21] YuviPanda: Oh, I need to get that under_nda thing RSN. [16:21:29] RSN? [16:22:22] scfc_de: I've acknowledged it till 18 Sep [16:22:42] RSN = real soon now. [16:25:15] ah [16:25:16] heh [16:25:21] scfc_de: you should poke mutante [16:31:42] !log deployment-prep Delete deployment-graphite instance [16:31:44] Logged the message, Master [16:33:59] 3Wikimedia Labs / 3Infrastructure: List of SVN users who were not migrated - 10https://bugzilla.wikimedia.org/58687#c8 (10Tim Landscheidt) I assume that *all* SVN users have been migrated to LDAP/Gerrit/wikitech/Labs (causing such errors as bug #61967), so I don't see a way to differentiate between an accoun... [16:47:44] YuviPanda: when a lab instance do puppet agent -tv, the node default { } is called ? [16:47:58] kindaaaaa [16:48:04] puppet is a declarative language [16:48:08] so 'called' isn't really accurate [16:48:18] yeah. included ? [16:48:30] tonythomas: I suggest going through https://docs.puppetlabs.com/learning/ for a bit, shouldn't be too long [16:48:35] yeah, along with other things [16:51:30] YuviPanda: that looks a really long read though. have been there once. [16:51:36] *had [16:52:00] considering the complexity of puppet, reasonable though [16:52:11] true [16:53:10] the starting point is only site.pp right ? [16:54:48] and the funny things -> inside node default { }, I see only a realm switch, which have two includes if 'production' and nothing, if the realm is labs :\ [17:05:52] tonythomas: There's some magic that pulls classes to be applied to an instance from the instance's LDAP entry IIRC. I think the mail class collision was topic on the bug with which andrewbogott moved the simple-sender bit to role::labs::instance, and AFAICR we concluded that we might need to revisit that if it leads to something unsurmountable. [17:20:24] scfc_de: ok. in that case. I 'grep -R "mail::sender" *' and see it included in role/labs.pp too. might be some kindof inclusion from there ? [17:24:42] tonythomas: (BTW, you can use "git grep" in a Git repo -- much cooler, faster, hyper! :-)) Didn't you remove the include from there? (I followed your work only from the sidelines.) [17:42:19] scfc_de: CPU monitoring in place \o/ [17:42:32] scfc_de: Coren I'm wondering how to setup monitoring for OOM situations [17:43:24] You can't monitor for /OOM/ per say, save by parsing the syslog to see the oom killer in action. You could monitor for /low/ memory though. [17:43:33] per se* [17:48:39] Coren: yeah, low memory [17:48:40] MemFree? [17:48:50] I am not fully sure how VMEM/MemFree etc interact tho [17:49:07] YuviPanda: You only care about real memory; not VMEM [17:49:13] hmmm [17:50:06] Coren: MemFree? http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410457795.695&target=tools.tools-dev.memory.MemFree.value [17:51:02] YuviPanda: That seems to not take buffers and cache into account. [17:51:07] scfc_de: exec-12 only fails on vips now [17:54:53] a930913: Why do you have a python running in a detached screen session? [17:57:43] Coren: Screen or mosh? :p [17:58:01] I shouldn't have a screen on login... [17:58:02] screen. It dated from March, and was idle for months. I killed it. [17:58:42] Coren: :o March. [17:58:49] Coren: Know what the script was? [17:58:55] The python I mean. [17:59:02] There wasn't. Just 'python' [17:59:15] Coren: yeah, unsure which metric takes those into account [17:59:33] But if it /was/ running something, then I'd have been actually annoyed. :-) [17:59:35] Must have been testing something :o [17:59:37] perhaps sum buffers, cache, active, subtract from total? [18:00:25] YuviPanda: memfree+buffers+cache gives actually available mem. If you alerted on memfree, we'd be in trouble. :-) [18:00:31] Coren: The only reason I'd be running a screen on login, is because I would be interacting over an unreliable connection. [18:00:44] hehe [18:00:58] Coren: not really sure how to check for 'low memory' tho [18:01:30] memfree+buffers+cache < some value is "low memory" [18:01:35] hmm [18:01:38] ok [18:01:56] 256MB? [18:02:00] 128? [18:02:43] Though if you wanted to be really, really safe you'd actually want something like (memfree+buffers+cache-swapused < 0) as this better reports the core working set. [18:03:24] (a.k.a "How much ram would be left if every process woke up at once) [18:03:57] Anything < 0 means the box would trash swap. [18:43:56] scfc_de: how did we resolve the vips issue? [18:44:01] scfc_de: did we upload the package to apt.? [18:44:12] scfc_de: or did we just put in labsdebrepo? [18:44:18] if latter, can we do that for trusty as well? [18:47:14] dump [19:03:03] scfc_de: remove the include from labs.pp ? no. that too is necessary ? [19:39:53] Beta Labs is dead [19:39:59] StevenW: Link? [19:40:16] <^d> http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [19:40:20] wfm. [19:40:29] <^d> Unexpected non-MediaWiki exception encountered, of type "BadMethodCallException" [19:40:30] <^d> [4bf90380] /wiki/Main_Page Exception from line 164 of /srv/mediawiki/php-master/extensions/Wikidata/vendor/wikibase/data-model/src/Entity/Item.php: Call to a member function hasLinkWithSiteId() on a non-object (NULL) [19:41:00] Beta enwiki is down, then. Betacommons seems to be working. [19:41:02] https://www.irccloud.com/pastebin/XK19grD4 [19:41:09] <^d> mashing f5 doesn't make it go away. [19:41:11] <^d> :p [19:41:12] heh wikidata again [19:41:37] marktraceur: I get https://dpaste.de/BDQr when visiting http://en.wikipedia.beta.wmflabs.org/ [19:41:37] Yeah that's it. [19:43:25] * Damianz_ f5's chad and gets dahc [19:57:58] StevenW: me too. wtf [19:58:29] scfc_de: all icinga toollabs alert greeeeen! :D [19:58:29] aude was looking into it. [19:58:49] scfc_de: we're also already meausring loadavg, but I don't think we should alert based on it [20:00:20] 3Wikimedia Labs / 3deployment-prep (beta): beta labs down with Wikidata error - 10https://bugzilla.wikimedia.org/70740 (10Chris McMahon) 3NEW p:3Unprio s:3critic a:3None Visit http://en.wikipedia.beta.wmflabs.org/. Get error: Unexpected non-MediaWiki exception encountered, of type "BadMethodCallExce... [20:00:58] * aude hides [20:01:05] StevenW: ^d I BZ'd it ^^ [20:01:20] reverting, jenkins gate and submit [20:01:48] aude: was that you then? [20:02:54] yes [20:03:02] baffled why test.wikidata is fine and beta is not [20:03:14] if test.wikidata / test2 breaks then we rever there also [20:07:58] <^d> Browsertests from jenkins to beta are getting backed up. [20:08:51] <^d> Also, jenkins has a bug where if you collapse the build queue it won't reopen. [20:10:38] lovely, jenkins say no to revert [20:10:48] <^d> no revert 4 u [20:11:15] ok, it's the test that is broken [20:12:09] overruled! [20:19:44] 3Wikimedia Labs / 3deployment-prep (beta): beta labs down with Wikidata error - 10https://bugzilla.wikimedia.org/70740#c1 (10Chris McMahon) 5NEW>3RESO/FIX aude reverted the bad commit [20:20:31] i think the problem is stale items in memcached [20:20:47] i'll try tomorrow again and bump the cache key [20:21:06] and maybe make the code more robust to handle this [20:21:10] probably* [20:21:40] we always bump cache key on test.wikidata etc, so no problem there [20:25:21] $wgCacheEpoch = wfTimestampNow(); [20:26:37] legoktm: not the same cache and we don't want to invalidate all the time [20:26:48] only whe pushing new code, so maybe with the git hash [20:27:01] well, I was joking :P [20:27:05] hah :D [21:51:14] 3Wikimedia Labs / 3deployment-prep (beta): Setup a mediawiki03 (or what not) on Beta Cluster that we can direct the security scanning work to - 10https://bugzilla.wikimedia.org/70181#c11 (10Dan Duvall) I've cherry-picked the patch to deployment-salt.eqiad.wmflabs and the last scap deployment seems to have sy... [23:01:33] there you go, icinga-wm [23:01:35] YuviPanda: [23:02:39] andrewbogott: oooh, i see what you mean, there are a lot more new checks on labmon now [23:02:45] as opposed to like yesterday [23:04:19] CUSTOM - ToolLabs: Excess CPU check: iowait on labmon1001 is OK: OK: All targets OK [23:04:22] ^ test, works [23:04:43] Coren: ^ tool labs monitoring in here now [23:09:59] you guys rawk [23:12:36] :) [23:54:10] Yeay! Monitoring! [23:54:42] Of course, since nothing /ever/ goes wrong it won't make much of a difference. :-) [23:56:07] Coren: hahahaha. if beta is stable for 48 hours I do a little dance [23:56:29] CUSTOM - yea, right, haha - OK [23:57:07] chrismcmahon: That might happen if the devs stopped messing with it. [23:58:21] as long as it's more stable than prod today [23:58:33] Coren: we're talking seriously about multiple beta-like envs. I wrote a lovely essay in Phabricator about why, and if fab were up I'd point you to it. [23:59:47] it won't be up