[00:00:14] RECOVERY - Puppet failure on tools-webgrid-generic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [00:00:28] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [00:01:50] PROBLEM - Puppet failure on tools-exec-20 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:03:36] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:07:48] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:08:54] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [00:11:41] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [00:13:13] RECOVERY - Puppet failure on tools-webgrid-08 is OK: OK: Less than 1.00% above the threshold [0.0] [00:23:20] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:31:50] RECOVERY - Puppet failure on tools-exec-20 is OK: OK: Less than 1.00% above the threshold [0.0] [00:47:00] OK, why is my tool failing to start? [00:48:17] Magog_the_Ogre: hi. which one is this? [00:50:55] nevermind, it started this time [00:50:58] it said Timeout: could not stop job in 30s [00:51:03] so I thought it was a startup error [00:51:34] ah, ok :) [01:47:41] wikibugs stopped responding again [02:10:19] Hi folks. I'm a new developer for a labs project, and I am wondering about storage. My VM includes 160GB of storage, but it's unclear where this is mapped in the filesystem. Does anybody know? [02:11:36] hi Shilad [02:11:52] hi! [02:12:08] Shilad: so by default it is unallocated, and you can allocate it by enabling the role role::labs::lvm::srv [02:12:29] Shilad: you can do so by going to wikitech.wikimedia.org/wiki/Special:NovaInstance, selecting your instance, clicking ‘configure’, and ticking the checkbox next to that one [02:12:44] Shilad: and then you run puppet (or wait 20 minutes) and you will have a /srv partition with the rest of your stuff [02:12:52] Awesome. Thanks. [02:12:54] Shilad: alternatively, you can just use lvm to create your partitions as you see fit [02:13:03] Shilad: you can run puppet with ‘sudo puppet agent -tv' [02:13:43] Do you know anything about it? Is it a reasonably-close-to-local SSD? Or is it NFS mounted? [02:14:09] Shilad: it’s basically a local spinning disk [02:14:23] Shilad: NFS is mounted in /data/project (and your home directory is NFS). use appropriately :) [02:14:38] YuviPanda: Great. Thanks! [02:14:48] Shilad: :) [02:26:51] !log tools created tools-exec-12{01-10} [02:27:20] Logged the message, Master [02:29:32] YuviPanda: what is the storage specification on m1 type instances? ("0 GB storage") [02:29:37] (03CR) 10Krinkle: [C: 031] Move Math repo from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206041 (owner: 10Jforrester) [02:30:41] Negative24: what do you mean by ‘storage specification’? [02:31:06] Negative24: oh, that ‘0GB storage’. ignore that... [02:31:13] NFS? [02:31:14] Negative24: look at root partition size, that’s how much you’re getting for realz [02:31:24] Negative24: so your root partition is always local spinning disks [02:31:31] and then your /data/project and /home is NFS [02:31:38] figured as much [02:31:57] just wondering if storage is "more" NFS [02:32:58] nope [02:33:07] I guess it used to refer to Ceph [02:33:07] but we no longer use cept [02:33:19] ah [02:35:21] !log tools set tools-exec-12{01-10} to configure as exec nodes [02:35:26] Logged the message, Master [02:36:21] gets me every time: phabricator's git auth uses a different password. the "VCS password" [02:37:51] * Negative24 wonders why Phab has to make things complicated. "For security reasons..." they say... [02:38:44] but at least phab-02 works [02:54:55] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:00:52] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:01:52] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:13:40] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:16:17] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:18:59] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:03:02] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/TheMesquito was created, changed by TheMesquito link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fTheMesquito edit summary: Created page with "{{Tools Access Request |Justification=Bot for general wikipeida-en stuff to join my channel on freenode |Completed=false |User Name=TheMesquito }}" [04:31:41] Welcome [04:31:57] Hello! [05:49:49] PROBLEM - Puppet failure on tools-mailrelay-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [06:44:36] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [06:47:18] Anyone around to poke wikibugs ? [07:09:36] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [07:49:50] RECOVERY - Puppet failure on tools-mailrelay-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:38:14] PROBLEM - Puppet staleness on tools-shadow is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [09:11:01] andrewbogott_afk, YuviPanda: is something wrong with the new hardware? I just spawned debian instance "huggle-d2" and it's REALLY very slow. Just launching aptitude took almost 10 minutes [09:11:29] and it's obviously not IO but CPU which is slowing it down [09:11:41] it has 1 core only, but still, it shouldn't be so slow [09:11:58] what CPU these boxes have, e5? [09:12:32] oh, it's e3... same thing I use in my own servers :o [10:20:31] PROBLEM - Puppet staleness on tools-exec-catscan is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [10:21:45] PROBLEM - Puppet staleness on tools-exec-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [10:24:15] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0] [10:26:13] PROBLEM - Puppet staleness on tools-exec-11 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [10:26:27] PROBLEM - Puppet staleness on tools-exec-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:26:43] PROBLEM - Puppet staleness on tools-exec-04 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [43200.0] [10:27:24] PROBLEM - Puppet staleness on tools-exec-09 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:28:00] PROBLEM - Puppet staleness on tools-exec-24 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [10:28:30] PROBLEM - Puppet staleness on tools-exec-22 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [10:29:16] PROBLEM - Puppet staleness on tools-exec-06 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:29:24] PROBLEM - Puppet staleness on tools-exec-23 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:31:52] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [10:32:16] PROBLEM - Puppet staleness on tools-exec-gift is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [10:33:12] PROBLEM - Puppet staleness on tools-exec-wmt is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:33:34] PROBLEM - Puppet staleness on tools-exec-12 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [10:34:34] PROBLEM - Puppet staleness on tools-exec-21 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:34:35] PROBLEM - Puppet staleness on tools-exec-13 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [10:36:07] PROBLEM - Puppet staleness on tools-exec-08 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [10:36:50] PROBLEM - Puppet staleness on tools-exec-14 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:37:04] PROBLEM - Puppet staleness on tools-exec-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:37:53] PROBLEM - Puppet staleness on tools-exec-10 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [10:40:25] PROBLEM - Puppet staleness on tools-exec-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [11:36:50] PROBLEM - Puppet failure on tools-static-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:47:00] Coren_away, YuviPanda, andrewbogott_afk: someone here? The CPU problems on huggle-d2 still happen [11:47:15] it's insanely slow, execution of few instructions take minutes [12:22:37] petan: can you check which virt node it runs on ? [12:22:48] hmhmhm [12:22:50] I have an instance on beta that has slow downs, it is on virt1007 [12:22:55] @labs-info huggle-d2 [12:22:55] I don't know this instance, sorry, try browsing the list by hand, but I can guarantee there is no such instance matching this name, host or Nova ID unless it was created less than 38 seconds ago [12:23:03] no longer works :/ [12:23:12] hashar: where can I find it [12:23:14] even had the kernel to report "BUG: soft lockup - CPU#1 stuck for 23s!" [12:23:20] petan: wikitech! [12:23:29] just search for your instance name [12:23:41] [ 5153.930319] hrtimer: interrupt took 11107333 ns [12:23:45] I have this [12:23:47] the search result should have a page to something like https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000000xxxx.eqiad.wmflabs [12:23:51] looks to me like kernel issue as well [12:24:22] labvirt1005 [12:24:47] hey [12:25:08] the cpu is really totally exhausted, cpuinfo say it's "2.4GHZ" but even running few instructions there takes forever [12:25:14] filling a bug [12:25:38] PROBLEM - Puppet staleness on tools-exec-20 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [12:26:48] petan: can you check dmesg ? [12:26:52] dmesg -T [12:30:21] petan: I have filled https://phabricator.wikimedia.org/T97520 - you are on CC - [12:30:31] thanks [12:30:42] petan: please look at dmesg -T [12:30:43] might help [12:31:03] yes I did, last message is [12:31:05] [Wed Apr 29 10:05:43 2015] hrtimer: interrupt took 11107333 ns [12:31:11] nothing else there [12:31:24] everything before this looks like normal boot messages [12:31:50] well [12:31:51] [Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain package detection failed [12:31:52] [Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain core detection failed [12:31:53] [Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain uncore detection failed [12:31:54] this too [12:32:01] 1005 is suspected of having issues; that is why Andrew has been avoiding it in migrations. Apparently, the suspicions were warranted. [12:32:39] well, it's not like it wouldn't work, it's just extremely slow, it would be cool if someone could ssh to bare metal and check the current CPU utilization of KVM [12:32:48] maybe there is just too many instances running on it? [12:33:00] does it have VT-x and VT-d enabled on CPU? [12:34:11] petan: Do try to shut down and restart the instance - we think some of the migrations towards 1005 left the vms in an odd state. [12:34:16] if the CPU utilization is indeed huge (too many busy instances), then it makes sense for it to run slow [12:34:21] ok [12:34:58] I issued "reboot" let's see if it's gonna do something :P it just locked up for now [12:35:09] Coren_away: would virt1007 be affected as well? [12:35:15] will try to reboot it [12:35:35] the task is https://phabricator.wikimedia.org/T97520 :] [12:36:08] hashar: We noticed nothing of the sort, but I suppose it's possible live migration just seemed to work right. :-( I'll notify andrew as soon as he wakes. [12:36:22] I am going to restart the instance [12:36:26] might fix it [12:38:38] I am just wondering how many instances are on virt5 now? [12:38:50] 1005 [12:39:24] petan: Few. That's not the issue. [12:40:52] hashar: is your instance ubuntu or debian? [12:41:04] ubuntu-14.04-trusty (deprecated 2015-02-03) [12:42:12] ok it's back up [12:43:15] cpu is still kind of fucked up [12:43:35] Stepping: 1 :P no wonder [12:43:46] it's first version intel made, now they are fixing the bugs... [12:44:09] * Coren_away chuckles. [12:44:19] They be virtual cpus. They're all model 42 stepping 1. :-) [12:44:58] Aha. Looks like hardware issue on virt1005 [12:45:00] idk, lscpu usually shows the real cpu even on virtual machines, at least for all my machiines where I run some virtual machines [12:45:06] [422616.696930] EDAC MC0: 13926 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#2 (channel:1 slot:2 page:0x1c9faa4 offset:0xfc0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:0 channel_mask:2 rank:8) [12:45:31] hmm... either memory or controller :/ [12:45:32] Actual cpu is model 62 stepping 4. [12:45:39] aha [12:49:33] We'll have to evacuate and shut that one down asap. [12:50:45] hashar: virt1007 is old hardware and old OS; but you can't currently trust wikitech for "where are instances" because while migration is going on that is only irregularily updated. [12:51:07] ohhhh [12:52:03] | b26e5c79-7190-431c-9fc9-e12bf05c0cd6 | deployment-parsoid05 | labvirt1005 | ACTIVE | [12:52:34] To noone's actual surprise. [12:53:12] As soon as Andrew wakes we'll make a plan to evacuate the failing hardware asap [12:53:32] Said plan may include yelling at the manufacturer a lot. [12:53:42] This is brand new hardware. [12:59:13] ahh [13:00:32] maybe that is just that instance having issue? [13:01:29] ah I have just seen your subtask [13:02:09] andrewbogott_afk: Ping! Wakeupworthy I believe. [13:02:27] do you really think this wakes him up? [13:02:36] o.o [13:02:51] petan: never underestimate our ops! [13:03:15] well I mean, pinging someone on irc? :D he had to be sleeping with laptop under pillow [13:03:37] no that I wouldn't :P [13:12:01] ok... [13:12:16] The backscroll is full of stuff from yuvi rebuilding things. Coren, can you summarize? [13:12:40] andrewbogott: User reports of instances being crappy lead me to [13:12:41] * hashar prepares coffee and donuts [13:12:54] https://phabricator.wikimedia.org/T97521 [13:13:08] Once I realized they were all on the same host [13:13:41] some instance started having very slow/stuck CPU petan and I have two examples that each run on labvirt1005. [13:14:29] So, that means that a single dimm is unreadable? [13:15:17] That is — do you judge that to be a hardware issue limited to a particular chunk of memory, or is it something systematic like the bus introducing errors? [13:15:34] andrewbogott: It's not immediately clear whether that's the dimm or the south bridge having issues; lemme grep the logs to see. [13:16:38] And, one of the symptoms is cpu spiking? (Because that may to be happening on 1003 as well…) [13:16:41] * andrewbogott checks logs on 1003 [13:17:11] andrewbogott: I see only errors for Channel 1, but more than one dimm reported. [13:17:50] andrewbogott: So "unclear". One dimm can cause issue on the channel - that wouldn't be the first time I see this - or there is an electrical issue with the bank. [13:17:57] Or the south bridge is busted. [13:18:03] for what it is worth, I restarted deployment-parsoid05 from the wikitech console and it does not come back. Error: Failed to terminate process 5990 with SIGKILL: Device or resource busy [13:18:04] I can't really say that CPU is spiking, it's just very slow in processing stuff, eg. instance is running VERY slow [13:18:04] I think that's "service call" time [13:18:38] my instance rebooted fine but is still very slow, try yourself it's huggle-d2 [13:18:53] just starting aptitude takes 2 minutes for it to load [13:19:01] Coren: I’m no good at reading dmesg, are there timestamps? Can you tell how long this has been happening? [13:20:18] If we reboot will it detect the failure during post and sequester the bad memory? Or will it detect the failure and refuse to boot? [13:20:20] andrewbogott: syslog is full of it going back to Apr 23 04:40:45 [13:20:50] btw Coren if wikipedia is not mistaken, CPU's hostbridge to handle RAM is called "northbridge" southbridge is for periferals, PCI cards and so on [13:20:52] andrewbogott: I don't know - depends on how the BIOS works and how torough its check is. [13:21:11] petan: Erm, yes. My compass is backwards this morning. :-) [13:21:48] petan: ok if I stop your instance? I want to see if 1005 is healthy enough to support an evacuation [13:22:03] it's even ok to nuke it, I just created it, totally empty [13:22:15] great, a perfect test subject [13:28:25] petan: ok, how does it look? [13:34:56] looks damn fast [13:35:37] great. [13:35:43] OK… next up, hashar, did you have a troubled instance? [13:36:19] yup deployment-parsoid05 [13:36:31] I tried to restart it via wikitech but it does not come back [13:36:56] hashar: If I accidentally murder it, would that be ok? [13:36:59] it is error status and horizon reports Failed to terminate process 5990 with SIGKILL: Device or resource busy [13:37:08] it runs the parsoid service [13:37:12] not sure how long it takes to rebuild it [13:45:00] Coren: I’m thinking that I’ll shrink these instances as part of the migration script. But I’m /also/ thinking that I’ll try suspending rather than stopping them… [13:45:07] Do you think that shrinking a suspended instance is bad news? [13:46:19] brb [13:46:28] andrewbogott: It *probably* isn't - by definition, the zeroed out blocks will remain zeroed blocks and - afaik - suspended instances does not keep the image file open [13:46:37] andrewbogott: But I'd test on a non-precious instance first. [13:46:41] * andrewbogott tries it [13:47:59] Second thing I'd worry about is the possibility of corruption of instances. It's not very likely, but it's possible. [13:50:07] Test seems to’ve worked fine [13:56:36] Goes to grab caffeine and food [14:17:01] oh for god sake [14:17:16] integration-saltmaster | labvirt1005 [14:17:16] :( [14:17:47] You know, I thought virt1005 was already depooled because you had already considered it suspect. [14:17:58] * Coren must be remembering wrong. [14:20:54] RECOVERY - Host deployment-parsoid05 is UP [14:21:02] there is some hope! [14:27:49] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/TheMesquito was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=156413 edit summary: [14:33:25] Coren: I removed it from the migration script but didn’t depool it from the scheduler. [14:34:18] Ah. [14:34:53] although obviously that would’ve been smarter [14:37:35] Coren: can we drain tools-exec-04 and then just delete it? [14:40:34] * Coren checks room on the queues. [14:40:55] Yeah, should be okay - especially as Yuvi is making a new batch. [14:41:21] !log tools disabled -exec-04 (going away) [14:41:29] Logged the message, Master [14:42:56] It's being difficult. Jobs are long to die. [14:43:07] There we go. [14:44:13] andrewbogott: can you confirm you are migrating instances out of virt1005 ? [14:44:18] !log tools -exec-04 drained; removed from queues. Rest well, old friend. [14:44:23] Logged the message, Master [14:44:25] deployment-parsoid05 went back up and is working just fine now as far as I can tell [14:44:34] andrewbogott: -exec-04 is now okay for reaping. [14:45:30] andrewbogott: Dunno if it's because of all the added activity from the evacuation, but the rate of ram error seems to be increasing. [14:47:14] !log tools deleting tools-exec-04 [14:47:18] Logged the message, dummy [14:48:55] Coren: so, I’m using two different migration scripts, one which suspends/copies and one which stops/copies/shrinks [14:49:13] Because I’m too nervous to do suspend/copy/shrink [14:49:28] So, the other exec nodes will be suspended and copied, when I get to them. They don’t need shrinking anyway since they’re new [14:49:43] wmf [14:49:52] Of course the copying takes ages [14:50:09] Sudden, obvious application for 10g ethernet [14:50:18] *wfm [14:50:43] WikiFedia Moundation [14:50:47] PROBLEM - Host tools-exec-04 is DOWN: CRITICAL - Host Unreachable (10.68.16.33) [14:51:35] and wikitech updates its host/virt information just fine apparently https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:I-000006d8.eqiad.wmflabs&diff=156398&oldid=156380 [14:51:39] ^O^ [14:52:08] hashar: It updates on an instance state change. [14:52:15] So when I do live migration, that’s invisible and it doesn’t update [14:52:25] but a cold-migration that involves a restart, that should cause an update. [14:52:50] I need to hack in a big red button that forces an update everywhere [14:53:30] I’ve already implemened daily polling updates but the subsystem that requires seems to be broken in icehouse :( [14:55:08] hashar: I just moved deployment-test. I’m going to do the other deployment instances now if that’s OK [14:55:20] andrewbogott: yeah sure go! :] [14:55:30] ok, next up is deployment-cache-bits01 [14:55:39] thank you for taking care of all that mess on wake up [14:56:00] meanwhile, I am wondering whether you guys run a memory check when receiving new hardware ( memtest86 ? ) [14:56:41] I don’t. These boxes take minutes to start up, I figured that was because they were running a memory check on each startup [14:56:43] I don't think that part of sop but you'd want to ask our DC meisters. [14:57:34] I am not sure how much it is worth it [14:57:48] but potenitally a live cd that boot on memtest86 would be a good check once a machine is rackd [14:57:49] racked [14:57:57] (live usb key) [15:54:22] Coren: does labs zero rev_text_id ? [15:55:00] I beleive it does, iirc [15:56:06] grr, then thats not the root cause [15:56:29] root cause of? [15:56:50] RECOVERY - Puppet failure on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:20] Coren: https://phabricator.wikimedia.org/T97532 [15:59:27] fwiw it's zeroed out in labs because it's worthless (no access to the text store) and can reveal that a suppressed edit has taken place in some cases where it shouldn't. [15:59:53] But if you need queries run in prod to validate something in re this bug you can bug one of us. [16:01:00] Coren: wasnt aware of the zeroing and thought that might be the cause of the bug. but its a false positive [16:01:00] Betacommand: I posted the root cause on the other bug [16:03:56] legoktm: which other bug? [16:04:09] Betacommand: the tracking one, https://phabricator.wikimedia.org/T97536 [16:05:24] legoktm: using the oldid= url hack they should still be visible though, shouldnt they? [16:06:20] valhallasw`cloud: think wikibugs died [16:06:21] https://en.wikipedia.org/?oldid=659870568 apparently not [16:06:59] legoktm: correct, which is what has be scratching my head [16:41:26] Betacommand: let me give it a nudge... [16:43:41] !log tools.wikibugs valhallasw: Deployed 1d785dc3ad22a434749f8ec0d466180f3de9ea52 channels: Continuous-Integration is now Continuous-Integration-Infrastructure wb2-phab, wb2-irc [16:43:46] Logged the message, Master [16:44:43] 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1245986 (10valhallasw) dum [16:44:46] Betacommand: ^ [16:44:53] thanks for mentioning it [16:45:24] valhallasw`cloud: np [16:45:37] thanks for fixing [16:57:49] PROBLEM - Puppet failure on tools-static-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [17:19:01] I have a question about i18n, what´s the better approach nowadays? Setup separate databases per language, or use Extension, or both? [17:19:14] bd808? ^ [17:27:19] renoirb: i18n for the wiki content? [17:27:28] bd808?, yes [17:27:36] As in content translation [17:28:58] Hmmm... I know we generally use separate wikis but there are places like meta and commons that have mixed languages. You might ask the nice folks in #mediawiki-i18n for best practices. [17:57:25] YuviPanda: let me know when you’re up and working? [18:03:16] OMG I found a solution to the LDAP thing. It is *so* ugly. :-) [18:05:22] Turns out the solution to keeping the cake and eating it too is to have a fake cake. :-) [18:06:57] what was the problem? [18:08:19] Platonides: Not directly related to labs; it's tech debt around an evil hack on the labstores that prevents us from using the same user settings as other places in prod. The short of it: NFS needs to have the LDAP groups to work right (because of the 8-group limit) but that conflics with our normal puppet admin user distribution. [18:08:32] Which is bad. [18:10:28] recompiling the kernel with a bigger group limit? [18:10:35] when something like the RAM hardware fail happens like on virt1005, i guess that doesn't really have an actionable, right? it's more like "shit happens" [18:10:41] That's not a kernel issue - that's a protocol issue. [18:11:02] mutante: "Yell at the vendor a lot" is an actionable. This is new hardware. :-) [18:11:05] the only thing i could think of would be "switch hardware vendor" but [18:11:13] that seems like "could happen with anyone" [18:11:22] Coren: heh, yea [18:11:40] well, i guess it all depends if it happens again [18:13:03] "[NFS] AUTH_SYS started off with 8, then went to 12, and finally settled on 16 supplemental group identifiers" [18:13:56] there's a fixed limit even on the rfc1831 :( [18:15:46] I know. The solution is to let the server handle perms. That works fine, so long as the server actually knows what the group membership are. [18:16:36] Turns out I was able to convince nslcd to provide *numeric* usernames for ldap entries so that they do not conflict with /etc/passwd entries. [18:16:41] Fake cake. :-) [18:17:28] And since the usernames are, in fact, the user IDs that works even if the tool is trying to be smart and recognize when you're giving a uid and not a username. [18:17:49] (Because either are the same) [18:24:48] hehe [18:31:27] 10Tool-Labs: Get rid of jlocal - https://phabricator.wikimedia.org/T95796#1246376 (10Ricordisamoa) [19:04:34] andrewbogott: am up now [19:04:36] * YuviPanda reads backlog [19:06:52] YuviPanda: I have a script moving most instances off of labvirt1005. I haven’t touched tools instances, they are: https://dpaste.de/NOM8 [19:07:07] Question is — those exec nodes aren’t pooled, are they? So I can just delete ‘em? [19:07:14] andrewbogott: you can syes [19:07:30] andrewbogott: let me switchover static-02 [19:07:44] @seen hashar [19:07:44] mutante: Last time I saw hashar they were quitting the network with reason: Remote host closed the connection N/A at 4/29/2015 3:48:44 PM (3h18m59s ago) [19:07:48] andrewbogott: mailrelay isn’t being used atm but valhallasw`cloud did some work on it yesterday - so would be nice to keep it alive [19:07:52] YuviPanda: shall I just delete the exec nodes now? [19:08:01] YuviPanda: ok, great, I will cold migrate it. [19:08:10] andrewbogott: yeah you can. [19:08:21] eh, feel free to kill mailrelay; it's in a broken state and I haven't done anything else than the initial puppet run [19:08:29] valhallasw`cloud: sure? [19:08:33] yes [19:08:40] ok, this is easy! [19:08:48] * andrewbogott deletes everything [19:09:05] I'm thinking of actually rebuilding it on precise for now, so we can depool -mail earlier [19:09:34] but YuviPanda probably disagrees with that ;-) [19:11:17] I'm not going to work on it today anyway, today it's LaTeX cursing day instead of puppet cursing day [19:11:50] PROBLEM - Host tools-exec-1201 is DOWN: CRITICAL - Host Unreachable (10.68.16.133) [19:11:51] !log tools failed over tools-static to tools-static-01 [19:11:57] Logged the message, Master [19:12:10] PROBLEM - Host tools-exec-1208 is DOWN: CRITICAL - Host Unreachable (10.68.17.119) [19:12:24] andrewbogott: you can cold migrate or kill tools-static-02. it’s trivially recreatable [19:12:34] PROBLEM - Host tools-exec-1401 is DOWN: CRITICAL - Host Unreachable (10.68.16.151) [19:12:38] PROBLEM - Host tools-exec-1203 is DOWN: CRITICAL - Host Unreachable (10.68.17.83) [19:12:44] PROBLEM - Host tools-exec-1202 is DOWN: CRITICAL - Host Unreachable (10.68.17.49) [19:12:47] I’d prefer to kill, since bandwidth for copying is the bottleneck at the moment. [19:12:54] YuviPanda: safe to kill now or shall I wait? [19:12:56] andrewbogott: feel free to kill. [19:13:20] * andrewbogott is nostaligic for the ciscos [19:13:25] heh [19:14:00] PROBLEM - Host tools-mailrelay-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.57) [19:14:15] YuviPanda: ok, everything is deleted. If you schedule new instances they’ll be scheduled somewhere not labvirt1005. [19:14:39] Thank you for rebuilding everything, this is going to save a bunch of time [19:15:25] andrewbogott: :) cool! [19:15:32] andrewbogott: should I start rebuilding now or wait? [19:15:45] I’m not sure. [19:15:53] Wait, unless tools is blocked for want of exec nodes. [19:16:02] YuviPanda: and objections against rebuilding mail with precise for now? that also seperates the question 'is everything puppetized' from 'does it work on trusty' [19:16:12] valhallasw`cloud: yeah, I agree. [19:16:24] andrewbogott: none deleted atm were pooled, so is ok. [19:16:35] andrewbogott: I want to rebuild at least tools-static-02. [19:16:39] sure, go ahead [19:17:07] andrewbogott: but all the current exec nodes have puppet disabled on them because there are destructive-ish changes live. I want to get on it sooner than later - puppet runs on a fresh exec instance take like hours [19:18:10] YuviPanda: There’s no real reason not to start building them now, other than my being paranoid. [19:18:21] The sooner you kill the old exec nodes the sooner that disk space is freed up. [19:18:24] yeah [19:18:43] So go ahead and build ‘em if that's next on your list. [19:19:59] andrewbogott: alright. [19:20:20] andrewbogott: what’re we going to do about the new nodes? scream at vendor? [19:20:20] err [19:20:22] new node [19:20:39] YuviPanda: scream at vendor, replace dimms, approach with caution. [19:20:45] alright [19:20:57] andrewbogott: we have enough room left even without that, right? [19:21:06] maybe! [19:21:08] Coren: sorry I didn’t see your swap patch and wrote my own :| [19:21:12] haha :) [19:21:39] o_O. I'm pretty sure I told you I was about to write it, and pointed it at you when I was done. :-P [19:22:58] They are suspiciously simiiar. :-) [19:23:29] Coren: yeah :| I forgot the mount, saw yours and then wrote the mount [19:23:47] Coren: I saw you mention that you were going to write it, and I thin you added me as reviewer before you left but I didnt’ see gerrit queue until it was too late [19:23:49] sorry about that [19:24:00] Things happen. :-) [19:28:46] !log tools recreated tools-static-02 [19:28:51] Logged the message, Master [19:30:07] !log tools set appopriate classes for recreated tools-exec-12* nodes [19:30:11] Logged the message, Master [19:32:55] Coren: re: the labs NFS puppet change, paravoid / bblack (I think those were the ones who -1’d the bash script?) should take a look maybe? I still do not feel comfortable enough with NFS / lvm snapshots to +1 it [19:33:17] !log re-created tools-mailrelay-01 with precise: [[Nova_Instance:i-00000bca.eqiad.wmflab]] [19:33:18] Coren: same for the ldap change too [19:33:18] !log tools re-created tools-mailrelay-01 with precise: [[Nova_Instance:i-00000bca.eqiad.wmflab]] [19:33:32] re-created is not a valid project. [19:33:36] Logged the message, Master [19:33:43] and I forgot to copy an s. Oh well. [19:34:11] andrewbogott: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Virtualization%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false has virt1001 spiking (just a fyi) [19:34:11] it also doesn't exist. What did I break this time >_< [19:34:19] valhallasw`cloud: what doesn’t exist... [19:34:36] oh, it's nova resource [19:34:37] that's why [19:34:40] derp. [19:34:59] ah heh :) [19:35:39] YuviPanda: maybe because something is being copied over? We’ll see if it settles. [19:35:44] andrewbogott: ok [19:45:02] YuviPanda: and this puppet run was done in a few minutes. What the heck? [19:45:19] okay, well, we have a new mail host then, I guess [19:45:40] valhallasw`cloud: presumably... [19:45:49] valhallasw`cloud: how to test? [19:45:54] ...or the puppet master is just slow [19:46:19] PROBLEM - Puppet failure on tools-mailrelay-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:49:02] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:51:58] PROBLEM - Puppet failure on tools-static-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:52:45] YuviPanda: hum, there seems to be something wrong with the gridengine manifest. gridengine-common_6.2u5-4_all.deb tries to unpack in /var/lib/gridengine, but that's our shared NFS mount so that breaks [19:53:11] hurr durr. I wonder if that’s a first run issue? [19:53:12] Coren: ^ [19:53:33] seems to be consistent on further puppet -tv runs, even after a manual apt-get remove gridengine-common [19:53:51] * Coren thinks. [19:53:54] Ooooo. [19:53:57] idmap. [19:54:14] it's trying to chown to an inconsistent uid. [19:54:25] probably, yeah [19:54:30] error setting ownership of `/var/lib/gridengine/utilbin.dpkg-new': Invalid argument [19:54:39] sorry, I should have mentioned that error immediately [19:54:54] * Coren ponders. [19:55:58] Heh. Spent so much time tracking down possible issues in other projects I never considered gridengine. [19:57:30] wait, sgeadmin has an ldap uid expressly for that reason. [19:57:41] what is it trying to chown /to/? [19:58:01] I'm not sure how to check; it's apt-get. [19:58:08] Lemme try by hand. What's the node? [19:58:09] it's in gridengine-common_6.2u5-4_all.deb [19:58:15] tools-mailrelay-01 [19:59:04] valhallasw`cloud: Coren it’s affecting the new nodes too [19:59:31] YuviPanda: I'd expect anything that tries a first-time install of gridengine-common [19:59:38] * Coren tries to see what apt is trying to do. [20:00:10] yeah [20:00:10] these single cpu hosts are horrible :/ [20:00:31] valhallasw@tools-mailrelay-01:~$ dpkg -c /var/cache/apt/archives/gridengine-common_6.2u5-4_all.deb | grep utilbi [20:00:32] drwxr-xr-x root/root 0 2012-01-24 20:46 ./var/lib/gridengine/utilbin/ [20:00:38] so, root, I guess? [20:01:18] unless it's some manual post install step [20:02:52] chown("/var/lib/gridengine/utilbin.dpkg-new", 0, 0) [20:02:59] That's not supposed to fail. [20:03:33] oh, you went for the strace option. Smart. [20:03:49] Hm. That's oddd. [20:04:35] What release is this? [20:05:06] precise [20:05:40] It looks like it's actually trying to do idmap and failing. [20:05:51] Which it shouldn't - I see the module option is there. [20:07:29] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:07:53] ... but it's not really. [20:07:57] dafu? [20:08:32] Oh, wait, stupid ordering in puppet? Mind if I reboot the instance? [20:08:58] Coren: go ahead [20:09:33] aha. [20:10:03] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1246872 (10valhallasw) 3NEW [20:10:08] That was it. Stoopid puppet actually mounts the NFS filesystems before the disable has a chance of being put in place. [20:10:24] Reboots will fix - I will make a puppet patch. [20:10:51] * valhallasw`cloud loves how no-one understands the order in which puppet does things [20:11:01] valhallasw`cloud: I understand it perfectly. [20:11:29] "Completely random, with a dose of Murphey (if an order can screw you, it will be done in that order)" [20:11:58] RECOVERY - Puppet failure on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:12:05] maybe I should rephrase that to 'no-one really gets what the effective dependencies in a real-life system are'? :P [20:14:00] RECOVERY - Puppet failure on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:33] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1246893 (10valhallasw) Need to think of an actual test plan. This should include the following mail **sources**: - receiving mail from an external host -- SMTP in and check delivery - delivering local mail - deliver... [20:15:56] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:16:53] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:17:49] on tools-bastion-02, i get timeouts when connecting to http://tools-static.wmflabs.org/ - is that intentional/permanent? [20:18:17] afeder: ah, that’s something that needs to be fixed. let me do that now. [20:21:19] RECOVERY - Puppet failure on tools-mailrelay-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:25:52] woo, someone changed something and the instance i'm running tests against got like 10x faster :) [20:31:00] valhallasw`cloud: Applied. This'll only apply to new instances though - those you created inbetween need a reboot. [20:31:17] booo [20:31:45] Coren: do you think you can takeover puppetrunning and pooling tools-exec-12**? there are 10 instances that are running puppet but need a reboot now because of this, I guess. [20:31:51] Coren: thanks [20:32:02] I want to get some availability metrics stuff done before end of week.. [20:32:04] Hm... I thought that it was just "copy-paste + search-replace" to support new wikis.... https://phabricator.wikimedia.org/T76939 :-D [20:32:25] Coren: if you’re busy with the NFS stuff (or anything else) I can do this no problems tho [20:33:06] YuviPanda: Can do; it's a one-liner. :-) [20:33:32] Coren: cool :) puppet is failing there because of the nfs issue tho [20:33:34] YuviPanda, Coren, could either of you glance over the test plan @ https://phabricator.wikimedia.org/T97574#1246893 ? [20:33:38] so need to verify they have working puppet as well [20:34:04] valhallasw`cloud: lgtm. you probably need one of us to set the MX domain record tho [20:34:06] for testing [20:34:09] -02 [20:34:26] *nod* [20:34:41] valhallasw`cloud: also, I think making it work on trusty and precise at same time might be too hard - if so, you can just test on toolsbeta and then switchover tools when you’re convinced it works [20:35:03] YuviPanda: I think we have to because we're not the only exim in town [20:35:13] YuviPanda: All 10 rebooted [20:35:59] valhallasw`cloud: have to for my suggestion or ‘both precise and trusty’? [20:36:25] YuviPanda: I think we need to keep precise working because there's other people who use the exim class [20:37:08] and it shouldn't be hard, really, just a few ifs [20:37:16] but it's puppet so probably hard, yeah. :-p [20:37:45] oh, and I need to figure out how to get access to puppet-compiler. Hashar, probably? [20:38:15] valhallasw`cloud: we shouldn’t mess around with the exim class, no? Just the tools relay one... [20:38:42] YuviPanda: I think we should first merge scfc's patch, then work on trusty [20:43:20] YuviPanda: when we have puppet-compiler, we can also check no changes will be deployed on the precise host! \o/ [20:43:41] valhallasw`cloud: I think making tools-beta better might be much easier and also more worthy. [20:43:54] YuviPanda: not really. [20:43:59] valhallasw`cloud: puppet compiler is going to give you nothing but pain and a million shaving yaks. [20:44:07] YuviPanda: and why exactly? [20:44:18] it tells you exactly how the catalog for a specific host changed [20:44:23] it’s closer to static analysis than actually checking. [20:44:57] valhallasw`cloud: it fails non-deterministically a lot of times. ori was fighting with it yesterday about how it sometimes doesn’t know what OS the target host is... [20:45:23] anyway, it’s just personal experience - I don’t have enough experience with it to tell you why it sucks, but general chatter in -operations and other channels is that ‘eugh, really?' [20:45:56] YuviPanda: yeah, but the alternative is deploying on toolsbeta, and that /really/ is 'eugh'. [20:46:14] I think if you use puppetcompiler enough you’ll change your opinion :) [20:46:20] and then you have to parse puppet agent -tv output to sort-of guess that it's doing the right thing [20:46:37] well, only because toolsbeta isn’t exactly the same as tools [20:46:53] ? [20:46:55] anyway, in an ideal world, puppet compiler would be super useful and great, no argument there. [20:47:04] just saying we’re far from there. And good luck :) [20:47:18] yeah, and in an ideal world deploying a change to toolsbeta would not take multiple minutes ;-) [20:48:54] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247026 (10d33tah) Hello, It's been 10 days since I asked for that. Could I please ask for an update on the review process? [20:49:52] oh, this is an issue. Notice: /Stage[main]/Toollabs::Mailrelay/File[/data/project/.system/store/mail-relay]/content: [20:50:05] so the two relays are going to fight over who's the relay [20:50:19] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247028 (10yuvipanda) Heya! sorry this got lost a bit - lots of other firefighting going on. So the biggest instance you will be able to get on labs has 160GB instance storage, so I'm not sure if it'll be big enough for you. Also, why do y... [20:50:28] valhallasw`cloud: oh ouch :| [20:50:31] thcipriani: is deployment-pdf01 ailing? I can’t ssh but maybe that’s normal due to a security rule or something [20:50:49] * thcipriani looks [20:51:21] Coren: cool! (re: reboot). Can you also start teh drain / repool? And !log so I / others know what’s happening? THanks! [20:51:32] YuviPanda: hmm. let's see how we can quickfix this... [20:51:55] valhallasw`cloud: I suggest an ‘active_mailhost’ param or something. we use something similar for active-proxy [20:51:57] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247039 (10d33tah) If I can compress the database, this can probably work. As for rDNS - I can look up the newly-created edits, but for the historical ones I have no option other than correlate them with an existing database. [20:52:01] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1247040 (10scfc) (If you find something usable for a "test suite", please report here. In a few weeks, I need to set up my own personal mail server, and I would really like to have some script(s) that I can run to be s... [20:52:06] YuviPanda: that's not really a quickfix :P [20:52:12] alright [20:52:35] super quick would be to disable the mailrelay on tools project and test on toolsbeta :) [20:52:48] * YuviPanda goes for meeting [20:53:15] YuviPanda: oh, just uncheck the puppet class. gotcha. [20:53:40] valhallasw`cloud: yeah. or even ‘sudo puppet agent —disable' [20:54:10] YuviPanda: *nod* done [20:54:27] YuviPanda: I'm almost done for the day, and tomorrow is going to be a stressful one. Can this wait for tomorrow morning? [20:54:28] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247047 (10d33tah) (if I can compress = if I will be able to run a compressed filesystem, either by having root on the VM or asking the admin to help me set it up) [20:54:31] andrewbogott: seems normal, looks like it's been having a trouble with puppet runs for a while. Latency seems normal, too. [20:54:33] Coren: oooh, sure. [20:54:41] Coren: sorry, my TZ map for you is still messed up. [20:54:44] thcipriani: great. [20:54:50] Coren: thanks! [20:54:51] thcipriani: mind if I suspend it for a few minutes? [20:54:51] :-) [20:55:06] (translation: If it’s not broken, I’m’a break it) [20:55:10] YuviPanda: FYI, looks like -1201 isn't actually configured as a node. All other 9 seem to be. [20:55:19] andrewbogott: heh, sure, go for it :) [20:55:27] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247048 (10jeremyb) >>! In T96512#1247047, @d33tah wrote: > either by having root on the VM or [...] root is no problem [20:57:17] thcipriani: done. Is it…still working as before? [20:58:30] andrewbogott: yes, everything looks the same. In fact, I didn't loose my ssh session(?) [20:58:41] cool, that’s the idea :) [20:58:44] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1247064 (10valhallasw) tools-mail and tools-mailrelay-01 are now fighting over who's the current relay (/data/project/.system/store/mail-relay). I disabled puppet on tools-mailrelay-01 for now. We should probably merge... [20:59:08] thcipriani: so… is it ok if I subject other deployment-prep instances to a similar suspension? Some might be a bit longer, none should be especially long. [20:59:13] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247067 (10d33tah) I'm interested in trying to run it under 160GiB then. [20:59:16] (That pause just saved us 29g of disk space) [21:00:07] wowza, that's a big savings, sure: go for it, that didn't seem particularly disruptive. [21:00:13] Was something changed for tools-login? Can't login anymore ("Algorithm negotiation failed."). Which encryption algorithms are allowed? [21:00:55] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247078 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I've created the project and added you as admin :) Remember that the /data/project and /home mounts on instances are NFS, and should *not* be used for any heavy lifting. [21:00:56] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1247081 (10yuvipanda) [21:02:01] YuviPanda: thanks! [21:02:09] d33tah: yw! [21:02:15] d33tah: sorry it took so long. [21:02:27] sure thing. i decided to ping you guys after 10 days [21:02:37] since i took a quick look at how long it usually takes and it was the max i think [21:03:06] d33tah: yeah [21:03:26] !log deployment-prep suspending and shrinking disks of many instances [21:03:31] Logged the message, dummy [21:03:57] now, i need to figure out how to connect to the vm ;) [21:04:02] YuviPanda: can you restart the bot in 'meetbot'? [21:05:31] apper: hello, i was involved in the sshd change, what's up [21:05:38] tested that on all 4 distro versions we use [21:05:52] interesting - bastion rejected my ssh key [21:06:05] ah [21:06:10] wrong ~/.ssh/config [21:08:27] mutante: I'm using an old version of SSH Secure Shell and I've get this message from tools-login ("Algorithm negotiation failed."). Key exchange failed. I can choose an algorithm (AES, Blowfish, CAST, ...) but didn't know which will work [21:10:18] apper: do you set the "Ciphers" option at all in your config? [21:10:19] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1247107 (10valhallasw) Actually, I think that fighting was a good server test on itself ;-) A few excerpts from the log file: ``` 2015-04-29 19:33:08 1YnXjI-0000C1-5i "root@tools-mailrelay-01.eqiad.wmflabs" from env-fr... [21:10:41] apper: so here is the actual change: https://gerrit.wikimedia.org/r/#/c/185325/5/modules/ssh/templates/sshd_config.erb [21:10:49] YuviPanda: i'm a bit confused - you say you granted me a project, but i didn't get any hostname or credentials... is there some article/section i should rtfm? [21:11:03] 10Tool-Labs: Provision and test tools-mailrelay-01 - https://phabricator.wikimedia.org/T97574#1247108 (10valhallasw) [21:11:13] apper: see how we already set it differently for older disto versions [21:11:38] apper: can you select AES then? [21:11:41] sorry am in a meeting [21:11:43] brb [21:12:56] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247111 (10d33tah) Thank you! I am a bit lost though - I didn't receive any hostname or credentials. Is there some article/section that I missed and I should have read? I flicked through the Wiki, especially "Help:" pages, but I could have... [21:13:10] sure, sorry [21:13:23] apper: try setting Ciphers aes256-ctr,aes192-ctr,aes128-ctr in client config [21:15:08] mutante: hmm.. i only have "aes256,aes192,aes128" as config options, without ctr [21:15:50] mutante: I think I have to choose a new client.... [21:16:09] mutante: this one seems to be outdated [21:17:07] d33tah: you need to create an instance via wikitech, see https://wikitech.wikimedia.org/wiki/Help:Instances [21:17:30] sitic: thanks, will do [21:19:43] apper: alternatively, try connecting through bastion1.wmflabs.org, which still runs ubuntu 12.04 [21:20:23] i can see 6 instances [21:20:36] oh wait, that has a more limited list of ciphers even [21:21:13] apper: putty, then? [21:21:35] valhallasw`cloud: yes I will try it [21:22:22] valhallasw`cloud: I tried it several times in the last 10 years and hated it every time, but I will try again ;) [21:24:31] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1247126 (10Ricordisamoa) [21:25:07] apper: what is your version btw? [21:25:29] sitic: hm... i can see the six instances, but can't create my own [21:25:58] mutante: SSH secure shell? 3.2.9 [21:25:59] apper: it doesnt work with the aes256 ? [21:26:06] mutante: nope [21:26:11] hrmmm [21:27:38] valhallasw`cloud: apper: i think bastion1.wmflabs.org would be the same issue then though [21:27:58] so if upgrading your client seems an option (putty) that would be easiest [21:28:42] putty works [21:29:10] I liked about ssh secure shell that I can use one program for console and file transfer [21:29:18] d33tah: you project has currently no instances: https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikispy (it should look like this https://wikitech.wikimedia.org/wiki/Nova_Resource:Abusefilter-global when you have created an instance) [21:29:26] but I can use putty+something else... [21:30:18] d33tah: I'm only a tools user, but I think you can create instances via configure on https://wikitech.wikimedia.org/wiki/Special:NovaProject (not sure though, as I can't do this) [21:30:49] apper: bitvise ssh can do that as well, but its terminal is meh-meh [21:30:58] apper: yea, sorry about that, but it's probably good if you upgrade since it's about removing old weaker ciphers [21:30:59] apper: I typically use putty + winscp [21:31:50] sitic: nope, doesn't look like [21:32:25] apper: another option is maybe ssh in cygwin [21:32:26] valhallasw`cloud: ^ d33tah can't find the link on wikitech to create an instance for his new project [21:32:30] mutante: yes, I think it's really a good idea to upgrade... 3.2.9 is from 2003... [21:32:47] apper: actually, winscp + putty should work for you. Open winscp, log in to site (gives you a file transfer window) [21:32:48] I'll try putty+filezilla [21:32:56] apper: then click 'open session in putty' to get a shell [21:33:04] apper: ok :) glad you have an alternative, those global changes to all servers are always finding a balance somehow [21:33:25] valhallasw`cloud: ah cool, then I will give winscp a try [21:33:52] is there a chance to have clickable hyperlinks on putty? [21:34:02] d33tah: it's the tiny 'add instance' in https://wikitech.wikimedia.org/wiki/Special:NovaInstance [21:34:05] apper: there was/is "nutty" for that [21:34:13] I have some scripts which just output wikipedia urls... [21:34:17] it's like putty just with clickable URLs [21:34:17] apper: with puttytray or nutty, yes [21:34:24] but unfortunately nutty is outdated too [21:34:29] afaict [21:34:46] https://puttytray.goeswhere.com/ [21:35:11] so my 12 year old ssh secure shell was able to make hyperlinks clickable, but a current version of putty isn't? :/ [21:39:06] starting putty from winscp works good, but I think I have to change my workflow regarding hyperlinks ;) [21:39:16] Thanks, valhallasw`cloud and mutante [21:40:10] apper: yw. you could still try if it works .. http://groehn.net/nutty/ [21:40:17] apper: puttytray puttytray puttytray [21:40:37] what valhallasw`cloud says [21:40:40] "Nutty hasn't been in active development since 2005. However the patch has been included in other PuTTY modifications such as PuTTY Tray. " [21:40:43] same patch :) [21:41:09] so puttytray or nutty? ;) [21:41:43] I'll try puttytray [21:41:45] puttytray [21:42:10] apper: the latest version even lists all links in the context menu :O [21:43:26] https://i.imgur.com/O1rqQFK.png [21:43:48] (they are duplicate because I echo'ed them, but that's hidden behind the context menu :p) [21:43:59] and you can also just click them, obviously [21:46:26] valhallasw`cloud: is it possible to find with ctr+f? if so, it's nto there [21:46:26] thanks [21:46:41] but it seems to be very buggy.... http://www.test.com works, http://www.test.de not... [21:46:42] d33tah: you might need to add your project to the projects list [21:46:47] in the textbox above [21:46:56] for the second one only http://www is clickable... [21:47:18] but for wikipedia links it should work, so thanks! [21:47:22] odd. there's a regex under window>hyperlinks [21:47:33] you can choose 'select nearly any' instead [21:47:41] valhallasw`cloud: ah crap, it's an usability nightmare [21:47:46] d33tah: yes, it is [21:48:06] it's not exactly virtualbox-style virtualization [21:48:19] okay, i can see [21:48:55] anyway [21:48:58] valhallasw`cloud: thanks, that works! [21:49:07] i can't find the button even though i can see "wikispy" on the list [21:49:52] d33tah: hm. [21:50:00] d33tah: https://wikitech.wikimedia.org/wiki/Help:Instances is the overall guide [21:50:08] d33tah: try logging out and in again [21:50:55] valhallasw`cloud: that helped, thanks! [21:52:59] I'm off to bed; good luck with the instance creation [21:54:23] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1247257 (10d33tah) Update: valhallasw`cloud helped me create an instance. [21:56:14] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1245178 (10Andrew) All instances are now migrated off of labvirt1005 -- Chris, you can do whatever you need to fix this box; I'm going to re-image it before putting it back to work. [21:57:52] andrewbogott: the shinken storm seems to have subsided in -releng, pausing completed? [21:58:24] thcipriani: yes, all done. [21:58:35] cool, thanks! [21:59:18] Coren_away: YuviPanda: labvirt1005 is now instance-free. The rest of the cluster is not yet groaning under the load, so I think we’re OK for the moment. [21:59:54] Why do we have so many hardware issues? [22:01:18] Krenair: i wonder if we are above average, actually. maybe it's just a matter of scale, we recently went over 1000 servers [22:01:38] Krenair: or it's only virt* boxes because Cisco (vs. Dell) [22:01:53] needs stats? [22:01:55] Wikimedia Labs | Status: stable. More partial interruptions coming next week | tools-dev has new OS and fingerprint: https://lists.wikimedia.org/pipermail/labs-announce/2015-April/000009.html | https://www.mediawiki.org/wiki/Wikimedia_Labs | Channel logs: https://bit.ly/11GZvbS | Open bugs: http://bit.ly/1l2wFhO | Admin log: http://bit.ly/ROfuY5. [22:02:29] Krenair: We’ve had quite a few issues with the newest line of servers we’ve bought. I don’t know why they’re so error-prone. [22:02:36] Oi, that "open bugs" link is old [22:02:59] Hm, so it is. I’ll just remove it, the topic is a bit wordy anyway [22:03:05] :) [22:03:33] how about https://phabricator.wikimedia.org/tag/labs/ [22:04:04] mutante: is it useful? [22:06:06] andrewbogott: yes, but only if we use workboards in general [22:07:01] i dunno, it just seemed the simplest overview.. nevermind [22:20:17] mutante: Open tickets may be better https://phabricator.wikimedia.org/maniphest/?statuses=open%2Cstalled&allProjects=PHID-PROJ-msyn2z45n7mw45bfuscb#R [22:20:38] From "Open Tasks" in the sidebar on #labs [22:25:37] YuviPanda: I was shrinking instances and I may have damaged wdq-mm-02 — can you check and see if it’s OK? [22:26:52] ah, nevermind, it seems to be there, and fine. [22:33:23] andrewbogott: ah, ok :) [22:33:45] YuviPanda: I shrank everything except for the tools nodes. I’ll wait until you’re done rebuilding and then see what’s left. [22:34:11] It turnst out that suspending/shrinking works pretty well. So it might make sense (with the rest of the migraton) to live-migrate, suspend, shrink, resume [22:34:15] rather than cold migrate? [22:34:27] I can’t decide if it’s worth going to such extremes to avoid instance reboot. [22:35:20] andrewbogott: I think instances should survive reboot. [22:35:25] yeah [22:35:46] andrewbogott: so cold migrate sounds nicer :D [22:35:54] Probably! [22:36:06] Oh, except cold-migrate also changes private IPs, which is not so nice. [22:36:09] ouch [22:36:10] anyway, I’m out for tonight. We’ll see if I make it to tomorrow morning without another page [22:36:13] yes that’s not nice, yeah [22:36:16] andrewbogott: :) thank you! [22:49:36] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [22:50:44] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:50:58] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:51:28] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [22:51:43] fine, shinken-wm fine. [22:51:46] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:51:52] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:52:52] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:53:06] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:54:02] PROBLEM - Puppet failure on tools-static-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:54:35] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:55:13] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:55:53] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:58:50] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:16:47] RECOVERY - Puppet failure on tools-webproxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:19:31] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [23:19:37] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:20:45] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:20:53] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [23:20:57] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [23:21:29] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:21:49] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [23:22:52] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:23:06] RECOVERY - Puppet failure on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:23:50] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:24:02] RECOVERY - Puppet failure on tools-static-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:25:14] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [23:36:46] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1247746 (10Dfko) We're up to about a gigabyte, can I get a dump now? [23:43:18] 10Tool-Labs, 3ToolLabs-Goals-Q4: Phase out precise instances from toollabs - https://phabricator.wikimedia.org/T94790#1247787 (10yuvipanda) [23:43:19] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make webservice default to trusty on toollabs - https://phabricator.wikimedia.org/T94788#1247784 (10yuvipanda) 5Open>3Resolved a:3yuvipanda DONE! [23:43:38] 10Tool-Labs, 5Patch-For-Review: webservice start / restart shouldn't always need to specify type of server - https://phabricator.wikimedia.org/T97245#1247788 (10yuvipanda) 5Open>3Resolved a:3yuvipanda