[00:14:56] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Coyewpp was created, changed by Coyewpp link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fCoyewpp edit summary: Created page with "{{Tools Access Request |Justification=For academic research purposes, part of our ongoing UC Berkeley research on Wikipedians. |Completed=false |User Name=Coyewpp }}" [00:17:47] PROBLEM - Host tools-redis is DOWN: CRITICAL - Host Unreachable (10.68.16.18) [00:20:56] andrewbogott_afk: ^ is that you? [00:20:57] RECOVERY - Host tools-redis is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [00:23:08] tools-web seems unresponsive [00:23:08] and my tools-bastion prompt won't return either [00:23:26] Krinkle: hmm, my bastion prompt seems ok [00:23:34] Now it's all responding again. [00:23:34] !log tools rebooted tools-redis [00:23:50] Weird, for 2 minutes everything was stalled from my perspective [00:24:10] andrewbogott_afk: it might be the return of the CPU stalls [00:25:25] [13intuition] 15Krinkle fast-forwarded 06master from 1403f9c39 to 1439b4558: 02https://github.com/Krinkle/intuition/compare/03f9c3919de1...39b45588c621 [00:43:59] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Coyewpp was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=155811 edit summary: [00:44:23] Change on 12www.mediawiki.org a page Wikimedia Labs/Things to fix in beta was modified, changed by Greg (WMF) link https://www.mediawiki.org/w/index.php?diff=1628672 edit summary: historical [01:03:37] [13intuition] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/Krinkle/intuition/commit/213cbba886ebc6bcb2b2d53dc8d68326dcc40e01 [01:03:38] 13intuition/06master 14213cbba 15Timo Tijhof: js-env: Implement batching for API requests (100ms debounce)... [01:22:37] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:37] hmm [01:22:37] CPU stalls again on instances, I think [01:26:01] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 758630 bytes in 2.094 second response time [01:43:50] PROBLEM - Host tools-redis-slave is DOWN: CRITICAL - Host Unreachable (10.68.17.150) [01:49:21] RECOVERY - Host tools-redis-slave is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [02:23:12] YuviPanda: hello! [02:23:30] hey andrewbogott [02:23:48] andrewbogott: in backscroll, one freeze is the tools-redis, one is the homepage and one is tools-redis-slave [02:23:53] almost all had freezes on bsations too [02:24:14] Is it generally one at a time, or multiple instances freezing in concert? [02:24:35] andrewbogott: the ones I experienced (first two) were multiple ones [02:24:42] bastion and redis first time, and bastion and proxy second time [02:24:48] well, I noticed bastions only because I was on them. [02:25:53] Hm… it’s really hard to tell the difference between the cpu stall and migration side-effects, since migration saturates the network. [02:26:10] I guess I’ll leave the migration stopped for an hour and we’ll see if anything happens. [02:26:19] If things remain clear then I’ll finish up the tools migration overnight. [02:27:55] andrewbogott: ok. is everything except tools migrated? [02:28:02] no [02:28:09] the whole process will take a week or so [02:28:14] longer if I keep stopping in the middle :) [02:28:20] andrewbogott: hmm, ok :) [02:28:31] andrewbogott: just tools instances migrating in the middle of the night gives me jeebies :) [02:28:47] It’s been pretty smooth so far. [02:29:03] It goes smoother when I set things to happen less often, but of course that takes longer. [02:30:14] alright [02:30:48] Yuvi: remaining tools instances: https://dpaste.de/0jjX [02:33:25] andrewbogott: looking [02:33:43] andrewbogott: tools-login should’ve been dead... [02:33:46] andrewbogott: I remember killing it [02:34:00] Oh yeah, it’s shut down. I’ll delete it. [02:34:05] I’m not sure why it lives on zombielike [02:34:07] andrewbogott: cool [02:34:09] heh [02:34:14] it was the first instance ever created [02:34:16] for the tools project [02:34:21] * YuviPanda sheds tear [02:36:10] andrewbogott: anyway thanks :) [02:36:56] Individual instances will continue to freeze as each is migrated. I would hope that migration wouldn’t kill the network to a whole node, but… I guess it might be happening briefly now and then :( [02:38:43] andrewbogott: can I ask you to not migrate webproxy-01 and -02? [02:38:50] andrewbogott: and do that tomorrow when we’re around? [02:38:55] sure. [02:39:08] andrewbogott: we can trivially hotswap them around so there’s no public webservice downtime. plus these might still be somewhat fickle [02:39:16] Or I can do one of them right now. [02:39:45] andrewbogott: then do webproxy-02. is passive atm [02:39:48] ok [02:45:25] 6Labs, 10Tool-Labs, 6operations, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240689 (10yuvipanda) Actually, after talking with @ori some more I think Catchpoint should play a much bigger role in measuring uptime of toollabs. I am thinking we'll exp... [02:47:53] 6Labs, 10Tool-Labs, 6operations, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240691 (10yuvipanda) (using catchpoint for this might be overkill - and also its smallest resolution seems to be 5mins when I'd like this to be 1min) [02:48:06] 6Labs, 10Tool-Labs, 6operations, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240692 (10yuvipanda) p:5Triage>3Normal [02:50:30] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Mehdi was created, changed by Mehdi link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fMehdi edit summary: Created page with "{{Tools Access Request |Justification=A web datamining class at the university of Louisville. We are experimenting with wikipride and wikihadoop |Completed=false |User Name=Me..." [02:53:28] andrewbogott: are you still here? if so let me know when it’s done so I can switch it to being active? [02:54:07] -02 just finished. It’s on labvirt1006 now. [02:55:28] andrewbogott: cool, so I’ll switch it over now. [02:56:13] * andrewbogott nods [02:56:57] !log tools switching over webproxy active to tools-webproxy-02 per https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin [02:59:03] hmm [02:59:06] redis is taking forever to restart [03:02:52] andrewbogott: all done :) [03:03:03] andrewbogott: you can move webproxy-01 whenever you want [03:03:12] ok — I’ll just let the script take care of it if that’s ok [03:03:16] Should be done by morning [03:03:38] oh, except we want it on a different host from 02... [03:03:41] I’ll just move it now [03:03:43] andrewbogott: yeah. [03:03:57] andrewbogott: I think we’ll have to do some shuffling after the fact too, maybe. last I looked bastion-01 and -02 were on same host for some reason :( [03:05:23] 6Labs, 10Tool-Labs, 6operations, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240699 (10yuvipanda) I could also just do this in prod icinga, fwiw. [03:11:52] YuviPanda: ok, 01 is now on labvirt1004 [03:12:00] andrewbogott: \o/ thank you [04:44:00] PROBLEM - Host tools-webgrid-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.28) [04:48:19] RECOVERY - Host tools-webgrid-01 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [05:40:25] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1240816 (10yuvipanda) @faidon @akosiaris @mark this would make labs' lives a lot simpler if we can fix this. Anyone want to take a stab / knw who can take a stab? [06:36:49] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [06:41:08] PROBLEM - Host tools-webgrid-06 is DOWN: CRITICAL - Host Unreachable (10.68.17.163) [06:46:13] RECOVERY - Host tools-webgrid-06 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [06:53:40] PROBLEM - Puppet failure on tools-exec-13 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [07:06:50] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [07:23:45] RECOVERY - Puppet failure on tools-exec-13 is OK: OK: Less than 1.00% above the threshold [0.0] [07:32:10] I have some trouble logging in to tools-login.wmflabs.org and dev.tools.wmflabs.org using puTTY/SSH. I have never had problems logging into these before. My last successful session was maybe a few weeks ago. I'm aware that the SSH certificate has changed, and was expecting a warning/request to accept the new certificate. [07:32:14] The error message I get is "Couldn't agree a client-to-server cipher (available: chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr)". Anyone with a good idea? [08:37:17] 6Labs, 10Tool-Labs, 6operations, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1241053 (10valhallasw) p:5Normal>3Low Maybe Icinga for local 1-min resolution monitoring and Catchpoint for worldwide monitoring at a lower resolution? I think a 5 min... [09:09:53] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1241066 (10akosiaris) Hello, So I started having a look at this yesterday. labnet1001 does see the ICMP echo request packets on the wire but for some reason I am still researching it is not ever responding... [09:11:42] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1241072 (10hashar) @akosiaris from my past comment, could it be that NAT is not applied on the inbound interface and thus the packets are routed to the internet (since they have the public IP as a destinati... [09:35:42] 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1241091 (10akosiaris) The public IP is assigned on the outbound interface of labnet1001 so the ICMP echo requests should never be routed to the internet (and they actually are not as a tcpdump on the outbou... [11:37:49] 10Tool-Labs, 5Patch-For-Review: Epilog scripts for web services fail with exit codes 1 and 255 - https://phabricator.wikimedia.org/T96491#1241310 (10valhallasw) 5Resolved>3Open Reopening; this is still happening. I'm wondering whether we should just return 0 from the epilog script, irrespective of success;... [13:05:27] Heads up: Later tonight I'm going to go ahead with cleaning up the Labs related projects in Wikimedia Phabricator a bit. See https://phabricator.wikimedia.org/T89270#1240017 ("first phase" only) [13:13:20] 6Labs, 10LabsDB-Auditor, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, and 10 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1241473 (10valhallasw) > Question: I'm also wondering if renaming Tool-Labs to #Tool-Labs-Infrastructure might make people realize it's no... [13:33:17] Coren: here’s some terrible news: http://kvm.vger.kernel.narkive.com/5SuwYAcx/growing-qcow2-files-during-block-migration [13:33:57] * Coren facepalms. [13:34:47] So all those m1.large tools instances that were only 3-5g on the old hosts... [13:36:14] Oooo. Love the passive-agressive "thank you for using adblock" message on that page. I feel like replying "thank you for trying to make money by doing a blind mirror of a mailing list you contribute nothing to and google spamming so that your worthless mirror amongst thousands gets a higher ranking than the others" [13:38:12] andrewbogott: That's... actually an issue. But i would have expected most of those instances to have been already migrated once in the past and already have big qcows wouldn't they? [13:38:17] (But man, that sucks balls) [13:39:01] Coren: During the pmtpa-eqiad migration I did cold-migration, which was a literal scp of the vm files. So no resizing there. [13:39:12] So, yeah, some of them were already expanded, but not all. [13:39:14] Ah! Only /live/ migration. [13:39:29] And enough are expanding now that the new hosts are 80% full with labs only… well, way less than 80% migrated. [13:41:39] Well, I'm pretty sure it's possible to ensmallen qcow2 images after the fact - but that'd require shutting down the instance too. [13:42:02] oh, really? I figured that ensmallening was impossible/indeterminate. Let me look for that. [13:42:14] andrewbogott: https://unix.stackexchange.com/a/109943 [13:42:24] alternatively, create new drive, rsync, switch over / ? [13:42:25] Since a lot of tool nodes can be restarted without outage anyway… that might get us enough space. [13:43:06] valhallasw`cloud: those instructions are horrifying but maybe possible :) [13:43:07] or use software raid 1 to mirror? not sure if that's an option in practice [13:43:15] Actually probably things are zerod already. [13:43:23] andrewbogott: Most of those instructions are only useful when you delete things. [13:43:36] andrewbogott: Just doing a qcow2 -> qcow2 convert should do the trick. [13:43:56] ok, let’s try this... [13:44:39] can you drain tools-exec-11? [13:45:00] What do you think, should I try doing it in place on a running instance or is that definitely going to cause corruption? [13:45:09] I would assume that it’ll ruin the instance, except for that that link suggests it might work [13:45:16] That's definitely going to make hash browns out of the filesystem. [13:45:49] seems like, yeah [13:46:12] And, I’m converting the file that’s just called ‘disk’ I presume? [13:46:45] andrewbogott: Do a 'file disk' first to make sure that this is indeed a qcow2 image. :-) [13:46:52] http://linux.die.net/man/1/cpdup seems to be able to mirror a disk while running, so that might also be an option [13:47:03] disk: QEMU QCOW Image (unknown version) [13:47:25] valhallasw`cloud: At best, that'd get a time-inconsistent mirror. Also not an option imo. [13:47:51] ugh, I misread that page. I thought it would follow changes on the main drive [13:48:36] I was really hoping that ‘defraggle’ would catch on [13:49:50] !log tools -exec-11 draining [13:50:31] OK, time for daily bot-kicking... [13:51:49] andrewbogott: -11 emergency-drained. *sigh* That's still fairly disruptive, poor tools users will have gotten a lot of pain in the past two weeks. [13:52:28] oh, I didn’t even realize that draining caused disruption :( [13:52:39] Is there an alternative? Like, can we depool and wait a day? [13:52:55] Coren: Luckily I have most stuff scripted [13:53:42] andrewbogott: Waiting would help for nonrestartable jobs; but continuous jobs will unavoidably get a restart. [13:53:47] Coren: shall I shut down 11 now, or do you want to give it a bit? [13:53:58] andrewbogott: It's fully drained now. [13:54:02] ok [13:54:04] Coren: question, are the task and continuous task mixed or are there separate hosts? [13:54:24] Betacommand: They are mixed; the difference is only in the queue settings. [13:55:02] !log tools stopping tools-exec-11 for a resize experiment [13:55:08] Logged the message, dummy [13:55:20] Coren: Not sure if this would help long term, but setting a tasks host might let you de-pool it with fewer issues to use as a test platform [13:56:00] Mhm. drbd might be able to mirror the device live, but it looks too complicated to play with ;-) [13:56:41] Betacommand: Well, you'd expect normal operations to not have to depool a working exec node anyways - right now this is due to infrastructure woes. [13:57:28] Coren: Plan for the worst case, hope for the best case [13:57:30] it wouldn't really help in the case of exec nodes anyway, as they would still need to be drained (and I expect the qcow2->qcow2 conversion is pretty quick for a few gigs) [13:57:30] andrewbogott: But yeah, if you give tasks a day to drain from a node that reduces the disruption greatly (restartable tasks are supposed to be able to cope - that's why they are restartable) :-) [13:58:18] valhallasw`cloud: You can't deploy dbrd after the fact anyways. :-) [13:58:58] Coren: yeah, that seems to be a general thing with anything that mirrors :-p [13:59:08] andrewbogott: The lesson here though is that it is almost certainly better to cold-migrate absolutely everything that can be done that way. [13:59:16] PROBLEM - Host tools-exec-11 is DOWN: CRITICAL - Host Unreachable (10.68.17.144) [13:59:19] yeah :( [13:59:36] live-migration works great, except in the ways that it does not work great [14:01:24] That shrank it from 157G to 13G. And I can log in. [14:02:36] RECOVERY - Host tools-exec-11 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [14:03:33] Yeah, I would have expected the size to return to what it was pre-copy; it's not goinjg to be generally able to reclaim non-zeroed blocked but untouched blocks will be recovered. [14:04:02] !log tools reenable -exec-11 for jobs. [14:04:05] Logged the message, Master [14:06:22] Coren: for exec nodes, is there a reason to prefer shrinking over rebuilding? [14:08:40] iirc they need to be added manually to the grid master, but otherwise they should be exchangeable [14:09:11] and there are tasks that require precise hosts [14:09:26] (but I think hosts can still be built with precise, right?) [14:09:33] yeah, they can [14:09:55] Hm. [14:10:06] There's the issue with the host keys which is annoying. [14:10:51] But if we wanted to be as painless as possible, the very easiest way of doing things would be to make a bunch of new nodes then drain gently all the old ones. [14:11:10] And just delete them once done. [14:11:16] Coren: the ssh host keys? that should be fixed with scfcs patch that was merged yesterday [14:11:35] https://phabricator.wikimedia.org/T92379 ? [14:12:16] valhallasw`cloud: That cleaned up old rejects, but won't prevent the new changes from being troublesome. [14:12:54] Coren: Creating new nodes and then gradually killing the old ones doens’t sound harder than the alternatives. [14:13:07] It results in -xx numbering creep, but we’re not going to run out of numbers for a while :) [14:13:49] andrewbogott: I agree, except for the fact that we'll need moar quota while doing that. [14:13:58] * andrewbogott is VERY disappointed that we can’t use live migration freely [14:14:07] andrewbogott: Also: not fast - that first puppet run on an exec node is ~1h [14:14:36] Since draining the old hosts may be a matter of a day or days, 1h for setup seems minor. [14:14:40] Well, maybe not an hour but ungodly long. [14:14:59] Yeah, it does. Sounds like the least disruptive plan to me. [14:15:19] * Coren looks at his todo list. [14:15:22] ok, I’ll write an email so Yuvi knows what’s happening. You can correct me as needed. [14:15:58] andrewbogott: Think you can do the instance creation? Should be straightforward enough and when that's done I'll do the needful in gridengine. [14:16:11] sure. [14:16:23] Is it just a single role checkbox in the puppet classes gui? [14:16:25] Also then you get to make sure they're all created on the right hosts. :-) [14:16:49] andrewbogott: Yes, but also don't forget to scroll down at the bottom to set the ssh_hba thing. :-) [14:17:08] andrewbogott: We need roughly the same number of precises too. [14:17:28] ok. I’ll make one and you can check my work. [14:17:31] kk [14:17:42] * Coren returns to finishing the replication patch. [14:17:45] When you say ‘on the right hosts’ you mean on labvirt hosts? [14:17:56] Yeah. So they don't need to be migrated. :-P [14:18:08] That would... suck. :-) [14:20:55] Ah, everything but labvirt is already depooled. [14:21:11] Ah, good. [14:24:44] tools-exec-24 (existing) doesn’t have anything set for ssh_hba [14:25:36] nor -23, nor -22... [14:26:08] RECOVERY - Host tools-login is UP: PING OK - Packet loss = 0%, RTA = 7.33 ms [14:26:45] Coren: ssh_hba? [14:27:23] andrewbogott: Yuvi musta forgotten. :-) It should be 'yes' (no quotes) [14:27:46] Doesn't break running things on the grid, but prevents users from ssh there otherwise. [14:33:06] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Mehdi was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=155866 edit summary: [14:43:21] PROBLEM - Puppet failure on tools-exec-25 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:57:25] PROBLEM - Puppet failure on tools-exec-29 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:57:55] PROBLEM - Puppet failure on tools-exec-30 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:58:11] PROBLEM - Puppet failure on tools-exec-34 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:58:13] PROBLEM - Puppet failure on tools-exec-35 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:58:15] PROBLEM - Puppet failure on tools-exec-33 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:59:07] PROBLEM - Puppet failure on tools-exec-27 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:59:09] PROBLEM - Puppet failure on tools-exec-32 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:59:35] PROBLEM - Puppet failure on tools-exec-39 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:04:44] PROBLEM - Puppet failure on tools-exec-40 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:04:50] PROBLEM - Puppet failure on tools-exec-31 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:08:10] PROBLEM - Puppet failure on tools-exec-37 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:09:17] Coren: hba should be in puppet... [16:09:29] * YuviPanda hates manual steps [16:09:58] YuviPanda: It should be in hiera actually - because that's a parameter to something included from base [16:10:08] YuviPanda: No hiera back then. [16:10:26] Coren: we could’ve always set the global in the exec node role [16:10:33] we already set motd that way [16:12:14] * Coren says nasty things about using python for the sake of using python. [16:12:19] 10Tool-Labs, 5Patch-For-Review: Epilog scripts for web services fail with exit codes 1 and 255 - https://phabricator.wikimedia.org/T96491#1241934 (10yuvipanda) 5Open>3Resolved Yeah, this happened when the webproxy was being failed over yesterday night to help with migration to new labs instances. The curre... [16:12:27] RECOVERY - Puppet failure on tools-exec-29 is OK: OK: Less than 1.00% above the threshold [0.0] [16:12:44] Coren: you should say them to the people who -1’d the patch :) [16:12:58] RECOVERY - Puppet failure on tools-exec-30 is OK: OK: Less than 1.00% above the threshold [0.0] [16:13:10] PROBLEM - Puppet failure on tools-exec-38 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:13:12] RECOVERY - Puppet failure on tools-exec-34 is OK: OK: Less than 1.00% above the threshold [0.0] [16:13:16] YuviPanda: Not a battle worth fighting. [16:13:27] heh :) [16:13:43] It makes the scripts about thrice as large and complicated, but it's "simpler" that way. :-) [16:14:18] Coren: I think it’s probably simpler if amortized over everyone else who has to touch it. [16:14:36] anyway, that discussion is going to go nowhere so let’s not have it :) [16:14:39] YuviPanda: Not a battle worth fighting. [16:14:41] ^^ [16:14:49] Coren: when re-createing exec nodes are you using a numbering system of some sort? [16:14:54] to differentiate precise and trusty? [16:14:57] It's Andrew doing it atm. [16:15:03] hi andrewbogott [16:15:10] I don't think so. [16:15:13] ah [16:15:15] we should [16:15:16] YuviPanda: I’m in a meeting atm [16:15:18] ok [16:15:23] * YuviPanda checks [16:15:26] I didn’t do anything with numbering but I did create some of each [16:15:46] hmm [16:15:55] andrewbogott: just creating them also doesn’t set swap properly. [16:16:08] swap? [16:16:08] Coren: what formula were you using to set swap in the ones you did by hand? 2x RAM? [16:16:23] andrewbogott: yup. https://phabricator.wikimedia.org/T95979 [16:16:44] Coren: we should fix that in one go before recreating all of them because then we’ve to fix them all manually and AARGH PAIN [16:17:21] Coren: also can’t we just increase VMEM to 3x RAM instead of having to create Swap? Everything is dead if it hits swap anyway [16:17:26] YuviPanda: nothing I’ve done is very far along, want me to just delete everything I made and leave this to you? [16:17:33] andrewbogott: yeah... [16:17:48] YuviPanda: Everything grinding to molasses >>> anything being killed by the OOM killer [16:17:53] ok then! [16:18:00] Coren: idk, if it gets killed it gets rescheduled [16:18:13] RECOVERY - Puppet failure on tools-exec-35 is OK: OK: Less than 1.00% above the threshold [0.0] [16:18:13] YuviPanda: OOM killer is a -9. It's evil in the best of cases. [16:18:17] RECOVERY - Puppet failure on tools-exec-33 is OK: OK: Less than 1.00% above the threshold [0.0] [16:18:38] And OOM killer can do things like kill the shepherds, or even the sge exec. [16:18:45] oooh that’s right [16:18:51] Coren: so, 2x RAM? [16:19:07] RECOVERY - Puppet failure on tools-exec-27 is OK: OK: Less than 1.00% above the threshold [0.0] [16:19:15] we should puppetize that. or at least put that code into a script [16:19:27] IIRC, I went 3x ram [16:19:32] YuviPanda: done. The quotas are raised so you can create ~16 xlarges. [16:19:37] andrewbogott: ty [16:19:49] PROBLEM - Host tools-exec-40 is DOWN: CRITICAL - Host Unreachable (10.68.18.12) [16:20:02] 3x gave the right balance of letting processes commit to the real ram ammount but not touch swap in the general case. [16:20:29] PROBLEM - Host tools-exec-30 is DOWN: CRITICAL - Host Unreachable (10.68.17.88) [16:20:36] alright [16:20:42] It's not a critical tuning bit, so long as the h_vmem actually set on the node is equal to (ram+swap-1G) [16:20:54] Coren: can oyu comment on https://phabricator.wikimedia.org/T95979 for posterity, and I can get puppetize the swap creation itself? [16:21:11] PROBLEM - Host tools-exec-39 is DOWN: CRITICAL - Host Unreachable (10.68.18.13) [16:21:16] PROBLEM - Host tools-exec-33 is DOWN: CRITICAL - Host Unreachable (10.68.17.194) [16:21:35] PROBLEM - Host tools-login is DOWN: CRITICAL - Host Unreachable (10.68.16.7) [16:21:38] PROBLEM - Host tools-exec-31 is DOWN: CRITICAL - Host Unreachable (10.68.17.105) [16:21:50] PROBLEM - Host tools-exec-25 is DOWN: CRITICAL - Host Unreachable (10.68.16.7) [16:22:04] PROBLEM - Host tools-exec-28 is DOWN: CRITICAL - Host Unreachable (10.68.16.151) [16:22:08] PROBLEM - Host tools-exec-38 is DOWN: CRITICAL - Host Unreachable (10.68.17.239) [16:22:20] PROBLEM - Host tools-exec-37 is DOWN: CRITICAL - Host Unreachable (10.68.17.202) [16:22:21] PROBLEM - Host tools-exec-35 is DOWN: CRITICAL - Host Unreachable (10.68.17.197) [16:22:55] PROBLEM - Host tools-exec-32 is DOWN: CRITICAL - Host Unreachable (10.68.17.119) [16:22:55] PROBLEM - Host tools-exec-36 is DOWN: CRITICAL - Host Unreachable (10.68.18.14) [16:22:55] PROBLEM - Host tools-exec-26 is DOWN: CRITICAL - Host Unreachable (10.68.16.57) [16:22:57] 10Tool-Labs, 3ToolLabs-Goals-Q4: Harmonize VMEM available on all exec hosts - https://phabricator.wikimedia.org/T95979#1241971 (10coren) As a rule, swap = 3xRAM has proven to be a good balance of letting the jobs consume a good fraction of the ram without ever hitting swap (because h_vmem causes overcommit of... [16:23:03] YuviPanda: Another reason to not let the OOM wake is that it can kill process X as a consequence of process Y eating all the ram. :-) [16:23:11] yup [16:23:33] (Honestly, the only worse thing than the OOM waking on a live box is a kernel panic) [16:23:42] PROBLEM - Host tools-exec-34 is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [16:23:52] PROBLEM - Host tools-exec-29 is DOWN: CRITICAL - Host Unreachable (10.68.16.222) [16:25:11] PROBLEM - Host tools-exec-27 is DOWN: CRITICAL - Host Unreachable (10.68.16.60) [16:25:11] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, 3Labs-Q4-Sprint-4, and 2 others: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1241993 (10greg) [16:26:00] andrewbogott: when off meeting can you file a bug for the re-creations? is this going to affect deployment-prep? [16:26:20] YuviPanda: I’m just doing tools (and mostly just exec nodes) for starters. [16:26:25] Then we’ll see what the space situation is [16:26:29] alright [16:26:44] andrewbogott: I added other toollabs admins on the email [16:27:11] but first, breakfast and shower. [16:36:35] Coren: so in the current instances, you basically have an lvm volume that’s then set as swap, is that right? [16:36:58] Basically. [16:37:08] ok [16:37:48] 10Tool-Labs, 3ToolLabs-Goals-Q4: Harmonize VMEM available on all exec hosts - https://phabricator.wikimedia.org/T95979#1242034 (10yuvipanda) p:5Low>3Normal [16:48:15] o/ YuviPanda [16:48:29] Shilad is having trouble using that Wikibrain project. [16:48:51] Could I bother you to take a look to make sure he's admin on it? Or should I file a ticket? [16:50:39] 10Wikimedia-Labs-wikitech-interface: Cleanup and enable UserFunctions extension on wikitech - https://phabricator.wikimedia.org/T47455#1242069 (10Aklapper) [16:51:28] 10Wikimedia-Labs-wikitech-interface: Incorrectly attributed edits (from 2006) to me - https://phabricator.wikimedia.org/T59346#1242071 (10Aklapper) p:5Triage>3Lowest [16:55:42] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-DNS, 10Wikimedia-Extension-setup, 7Mobile: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#1242075 (10Aklapper) p:5Triage>3Low [16:55:44] 10Wikimedia-Labs-wikitech-interface: Project Bastion has service groups - https://phabricator.wikimedia.org/T64537#1242077 (10Aklapper) p:5Triage>3Low [16:55:52] 10Wikimedia-Labs-wikitech-interface: Remove Puppet class generic::packages::git-core and replace misc::package-builder with role::package::builder::labs - https://phabricator.wikimedia.org/T71135#1242079 (10Aklapper) p:5Triage>3Low [16:56:09] 6Labs, 10Wikimedia-Labs-wikitech-interface, 7Monitoring: Monitor for wikitech logins failing - https://phabricator.wikimedia.org/T91226#1242081 (10Aklapper) p:5Triage>3Normal [17:11:51] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1242135 (10Shilad) Hello! I believe I listed the wrong Wikitech username. It should be "Shilad Sen" instead of just "Shilad". Are you able to change this? Sorry for the mistake! [17:14:35] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1242167 (10coren) [17:14:37] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, 3Labs-Q4-Sprint-4, and 2 others: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1242165 (10coren) 5Open>3Resolved This is now done for all instances that were alive at all - given the ample warnings to the mailing li... [17:14:39] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precise instances - https://phabricator.wikimedia.org/T95555#1242168 (10coren) [17:15:05] YuviPanda: Quick +1 to https://gerrit.wikimedia.org/r/#/c/203864/ ? [17:15:37] Just so I can close the ticket - this is just post-action cleanup. [17:15:54] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1242170 (10Andrew) 3NEW a:3yuvipanda [17:16:04] YuviPanda: is that what you needed? ^ [17:20:23] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-DNS, 10Wikimedia-Extension-setup, 7Mobile: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#1242194 (10Dzahn) Since the DNS part is done i'm going to remove that project tag. wikitech.m.wikimedia.org has address 208.80.154.236 w... [17:20:35] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-Extension-setup, 7Mobile: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#1242195 (10Dzahn) [17:21:10] halfak: Shilad still having trouble? I can look. [17:21:26] andrewbogott, last I heard, yes. [17:21:35] It seems he doesn't see "manage instances" on the left. [17:21:51] Also, when he goes to NovaInstance (or whatever) he doesn't see wikibrain listed. [17:22:03] just tools/bastion [17:22:47] Yes, some other Shilad was added :) [17:23:50] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1242223 (10Andrew) I've removed user 'Shilad' and added 'Shilad Sen.' Hope that helps! [17:26:03] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-Extension-setup, 7Mobile: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#1242240 (10Krenair) Looks like MobileFrontend is enabled on the site now as well... I guess we need some apache config changes to point wikitech.m.wm.o to t... [17:35:10] YuviPanda: remaining unmigrated tools instances are: tools-webgrid-03, tools-webgrid-08, tools-webgrid-generic-02, tools-webgrid-generic-01. Can you advise about which I should live-migrate (wastes space) vs. which I should cold-migrate (requires reboot) vs. which you want to just rebuild? [17:36:05] andrewbogott: sadly you’ve to live-migrate them all or rebuild them all, whichever you prefer. we can cold-migrate them with draining if you want tho [17:37:00] rebuilding requires us to drain as well, right? [17:37:10] So may as well just cold-migrate unless rebuilding is trivial. [17:37:59] andrewbogott: yeah, we can cold migrate. [17:38:09] Coren: ^ can you help drain / move around? [17:38:25] these should be faster because we can just qdel them and let the service manifests take care of them [17:38:33] YuviPanda, Coren No rush, just let me know which ones to move and when. [17:39:35] Coren: 30 minutes to 3hrs :P [17:39:48] (03CR) 10Jforrester: "FIXME: This wasn't rebased. :-(" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/196852 (owner: 10Awight) [17:39:53] Betacommand: Yuvi-borne paranoia. I fully expect the actual downtime to be 2-3 mins. :-) [17:40:49] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precise instances - https://phabricator.wikimedia.org/T95555#1242331 (10coren) [17:41:11] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1242341 (10coren) [17:41:13] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precise instances - https://phabricator.wikimedia.org/T95555#1242339 (10coren) 5Open>3Resolved Purged idmap config from instances. With that, this is done. [17:41:14] YuviPanda: Sure. [17:42:05] Coren: thanks [17:42:16] 6Labs, 10Gerrit-Migration, 6Phabricator: Stabilize vcs-user owned files and directories in Phab-02 - https://phabricator.wikimedia.org/T95982#1242353 (10coren) Given that only one instance uses this for the moment, this does not block turning off idmap. [17:44:26] (03CR) 10Jforrester: "Hmm, actually, no, but the tool isn't using this config version?" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/196852 (owner: 10Awight) [17:44:42] !log tools -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained [17:44:53] Logged the message, Master [17:44:53] andrewbogott: ^^ you can do all but -webgrid-generic-02 [17:45:24] Coren: cool. [17:45:37] Coren… check out https://rhn.redhat.com/errata/RHBA-2013-0543.html and search for ‘qcow2’ [17:45:57] It seems like someone, somewhere, fixed this issue. I can’t make any sense out of the version numbers though. [17:46:20] but even if they did, we can’t recompress them can we [17:46:21] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1242389 (10coren) [17:46:24] 6Labs, 10Gerrit-Migration, 6Phabricator: Stabilize vcs-user owned files and directories in Phab-02 - https://phabricator.wikimedia.org/T95982#1242390 (10coren) [17:46:36] YuviPanda: no, we can’t, but there are still hundreds of instances that need migrating. [17:46:42] andrewbogott: true [17:48:16] andrewbogott: That presumes that whatever that block_stream thing is is actually what is used by live migration? [17:48:21] andrewbogott: that's https://bugzilla.redhat.com/show_bug.cgi?id=832336 [17:48:34] (but you probably had found that one already) [17:48:37] I’m using the —block-migration flag… [17:51:23] andrewbogott: the bug says qemu > 1.1 should have that bug fix, but that's an ancient release (then again, it's a 2012 bug) [17:51:42] right, I’m running 2.0.0 [17:51:45] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1242445 (10Shilad) Perfect. Thanks! [17:54:18] valhallasw`cloud: I guess I’ll file a bug w/qemu and see what they say [17:54:29] PROBLEM - Host tools-webgrid-03 is DOWN: CRITICAL - Host Unreachable (10.68.17.123) [17:55:39] RECOVERY - Host tools-webgrid-03 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [17:57:43] PROBLEM - Host tools-webgrid-generic-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.152) [18:00:41] PROBLEM - Host tools-webgrid-08 is DOWN: CRITICAL - Host Unreachable (10.68.18.35) [18:01:22] *nod* there's some stuff on migrating block devices on the qemu wiki, but it's all on 'future plans' wiki pages [18:01:52] so, well, hopefully the developers have an idea what's there and what isn't ;-) [18:03:09] RECOVERY - Host tools-webgrid-generic-01 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [18:03:12] Coren: ok, I’ve moved all but tools-webgrid-generic-02 [18:03:34] Lemme reenable them and drain the other [18:05:12] !log tools reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02 [18:05:16] Logged the message, Master [18:05:21] andrewbogott: All set. [18:05:37] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:05:43] RECOVERY - Host tools-webgrid-08 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [18:05:46] Coren: ready for me to move 02? [18:05:59] * Coren nods. [18:07:50] andrewbogott, sorry to run off. meetings :/. So did you get the right shilad added to wikibrain? [18:08:19] halfak: no worries, seems to be fixed. [18:08:25] kk Thanks [18:11:29] Coren: ok, -02 is moved. [18:11:44] !log tools reenabled -webgrid-generic-02 [18:11:47] Logged the message, Master [18:12:08] man, copying these things is /way/ faster when I don’t have to move all the empty space [18:14:25] Coren, YuviPanda: I’m heading to lunch. I’ll send an email with an updated-and-revised-migration-plan later today. [18:14:40] andrewbogott.getlunch() [18:15:10] tools is now fully running on the new hardware. which is both good and terrible news. [18:17:25] <^d> I just killed 2 unused xlarge instances, go me :D [18:30:39] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [18:34:12] YuviPanda, hey dude what's the status of that shared postgres instance. [18:34:14] ^? [18:36:35] I'm getting an error when I try to do anything with pip in my virtualenvs on Labs: "ImportError: No module named _io" [18:36:39] has anyone else seen this? [18:39:30] but a fresh virtualenv runs as expected. [18:43:56] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-Extension-setup, 7Mobile: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#1242749 (10MaxSem) >>! In T87633#1242240, @Krenair wrote: > Looks like MobileFrontend is enabled on the site now as well... I guess we need some apache conf... [18:44:18] fhocutt: that's probably caused by running an old virtualenv on a more recent OS [18:44:44] thanks, valhallasw`cloud [18:44:49] fhocutt: (most tool labs servers have been upgraded from ubuntu 12.04 to 14.04 in the last months) [18:44:55] ah right [18:45:01] fhocutt: solution is to re-run virtualenv for that venv [18:45:07] great, will do that [18:45:18] i.e. if your venv is in ~/something, just go to ~ and run virtualenv something [18:45:28] will it nuke the modules I have installed or just touch the virtualenv itself? [18:45:37] no, should just update the venv itself [18:46:41] or try to, it seems [18:46:54] OSError: Command /data/project/hostbot/matchbot/bin/python -c "import sys, pip; sys...d\"] + sys.argv[1:]))" setuptools pip failed with error code 1 [18:47:05] OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_virtualenv.51322.crash' [19:03:04] about to run for a meeting but [19:03:33] fhocutt: valhallasw`cloud you should also do ‘-l release=trusty’ for jsub related things, and make sure your instance is running on trusty for webservices (webservice —release trusty start) [19:03:40] the second will become default tonight [19:03:58] YuviPanda: yeah, but which bastion host is still 12.02? [19:04:03] valhallasw`cloud: uh, none... [19:04:12] valhallasw`cloud: should create a tools-precise [19:04:20] YuviPanda, Coren, car was towed so my return (and my lunch) may be considerably delayed. [19:04:27] fhocutt: hum. did you deactivate the venv first? [19:04:28] valhallasw`cloud: if you want one in a pinch tools-dev.eqiad.wmflabs (not .wmflabs.org) [19:04:32] I don’t even own a car, and yet… towed. [19:04:38] ouch [19:15:01] ah! no, I didn't. [19:15:26] same thing there [19:19:17] fhocutt: strangely enough, the upgrade does seem to have worked [19:19:32] fhocutt: if you try to import _io in that venv python, it works [19:19:34] weiiiiiird [19:21:13] weird! [19:21:19] thanks, valhallasw`cloud [19:26:19] andrewbogott_afk: YuviPanda: and in the meantime I'm fighting a (minor) flood in my basement. Yeay spring. [19:40:03] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1243022 (10valhallasw) p:5Triage>3High I'll still mark this as 'High', as this should happen sooner than most bugs that are 'Normal' priority. [19:40:15] Thankfully caught in time unlike last year - no underwater power bricks this time. :-) [19:40:25] 10Tool-Labs, 5Patch-For-Review: Create separate partition for /tmp on toollabs exec / web nodes - https://phabricator.wikimedia.org/T97445#1243032 (10valhallasw) p:5Triage>3Normal [19:41:50] good evening [19:42:38] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1243038 (10yuvipanda) Yup. Let's just fix the disk layout as we go as well. [19:42:55] o/ hashar [19:43:43] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1243040 (10yuvipanda) I propose we make them all xlarge and have 10 for precise and 5 for trusty (exec nodes). Webgrid can come later. [19:44:38] Coren: YuviPanda: any clue how labs_bootstrapvz/ and labs_vmbuilder/ are being used to create the labs images ? [19:45:09] for CI I need to create some new images that will come with a bunch of packages pre installed to save on boot time [19:45:20] YuviPanda: https://gerrit.wikimedia.org/r/#/c/205915/ can be abandoned? [19:45:24] hashar: I know how it used to work, but there have been changes done by andrewbogott_afk. I'm sure he can help you once he catches his car. [19:45:59] luckily there are some docs on the wiki!!! [19:46:12] https://wikitech.wikimedia.org/wiki/OpenStack#Building_new_images [19:46:14] mind you [19:46:17] I was not expecting any doc! [19:48:09] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1243053 (10coren) It's not clear to me that fewer large instances is better than more smaller instances. Because of virtualization we don't "pay" for idle cycles, and small instances are easier to balance, faster... [19:54:25] seems you have disk space issues on virt nodes :/ [19:54:57] hashar: Yeah, live-migration had an unexpected side effect - it filled the (nominally) copy on write disks up. [19:55:12] So instances that used to take 15G on disk now take 160 [19:55:15] "yeay" [19:56:06] Made even more happy extra fun because the stupid thing actually spend network bandwidth copying those 145G of zeros. :-) [19:56:19] ahhh [19:56:31] so moving the instance from another triggers the copy right ? [19:56:42] I guess because it uses hardlinks [19:58:05] 10Tool-Labs: tools-redis broken - https://phabricator.wikimedia.org/T96485#1243092 (10yuvipanda) So I actually *enabled* overcommit - redis can't actually work properly without it - http://redis.io/topics/faq ctrl-f 'overcommit'. [19:59:21] Moving them from host to host while up does; because then it tries to be smart about keeping up with the changing disk and - apparently - does so from the virtual side (that doesn't know about holes in the image). Cold migration just copies the actual image file so doesn't have that issue. [20:05:21] there are so many different wrappers to build images [20:05:23] that is messy [20:06:57] hashar, when you said “seems you have disk space issues on virt nodes” was that because of an alert someplace? [20:07:24] andrewbogott: na I was reading some random task where yuvi said the virt nodes had little room left to accomodate more instances [20:07:28] due to disk constraints [20:07:52] andrewbogott: will it help if I rebuilt about, say, 6 instances right now? [20:07:54] ah on https://phabricator.wikimedia.org/T97437 [20:07:58] I can do those without any user facing issues [20:08:05] (non-exec nodes) [20:08:21] hashar: yeah, a bunch of migrated instances grew by 15x during migration. Pretty serious! [20:08:32] I didn’t notice until I did a bunch of xlarge tools nodes. [20:08:34] do you have a tracking task listing them ? [20:08:43] on integration, most of them can easily be recreated [20:08:49] on deployment-prep that depends :] [20:09:22] andrewbogott: no way to ‘recompress’ them? [20:09:32] hashar: if you are going to build new jessie images you should probably let me go first. I’m following some bootstrap-vz bugs and things might have broken since I last used it [20:09:39] hashar: due to jessie going gold [20:09:47] YuviPanda: yes, if we shut them down I can recompress. [20:10:03] andrewbogott: w00t, then maybe we can just do that... [20:10:37] Sure, it’s just a question of what’s less painful. With exec nodes… the user impact is identical, might as well get everything spruced up in the meantime. [20:10:40] Since we’re not in a rush [20:11:01] andrewbogott: yeah, cool - so let’s recreate the user ones but do this for some others? [20:11:07] andrewbogott: can you try on tools-services-01? [20:11:26] YuviPanda: is it safe to stop it? [20:11:33] andrewbogott: yup, is a hotspare. [20:11:42] ok, let’s see if I remember [20:12:43] Rebuilding some of the exec nodes would not be a bad thing. Tools-exec-01 is almost exactly as old as tools-login was. [20:13:05] andrewbogott: yeah I noticed some issue [20:13:15] Coren: yeah, I think we should puppetize /tmp and the swap and then rebuild them all to be consistently sized (right now they aren’t) [20:13:32] andrewbogott: I looked at the way images are created. Looks like that requires to be done on an instance to grab various config file from the instance build host [20:13:58] andrewbogott: since we expect resolv.conf / ldap.conf etc to be provided by a puppet run on wmflabs and "just" copy them to the image [20:14:28] I agree; but I'd do twice as many larges rather than xlarges [20:15:19] Coren: yeah, I’m ok with that too [20:15:32] Hah! YuviPanda, I just compressed tools-services-01 without looking first… it was small to begin with. Probably started out on a labvirt box in the first place. [20:15:42] Anyway, it’s still a validish test — is it running ok now? [20:15:55] andrewbogott: yup [20:16:04] andrewbogott: am pretty sure it started on a virt* box [20:18:34] andrewbogott: hmm, actualaly I can’t seem to ssh in? [20:19:10] !ping [20:19:10] !pong [20:19:10] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:15] err [20:19:29] andrewbogott: is this another CPU stall or network issue? I can’t ssh to anywhere... [20:20:05] Coren: ^ [20:20:10] hmm [20:20:12] seems to have resolved itself [20:20:43] I was about to say that I see nothing going on... [20:20:44] I still can’t ssh into tools-services-01 tho [20:21:13] I also see nothing untowards on the NFS graphs. [20:21:39] YuviPanda: didn’t you just say that it was fine…? [20:21:43] That's odd - I just logged in and it was basically instant. [20:21:48] andrewbogott: yeah back to fine again [20:21:56] might be my local network? [20:21:58] tools-services-01? [20:22:00] fine? [20:22:07] yeah [20:22:08] lgtm [20:22:16] lgtm me too [20:22:17] ok [20:22:21] Um… https://wikitech.wikimedia.org/wiki/OpenStack#Recompress_a_live-migrated_instance [20:23:10] can we start an etherpad for tools instances and figure out which ones we’ll recompress and which ones we will rebuild? [20:23:14] andrewbogott: I would have been surprised if you had been the first bitten by that misfeature. [20:23:28] (https://etherpad.wikimedia.org/p/tools-the-great-recompress) [20:23:37] Coren: I just wrote that a minute ago. [20:23:47] Oh duh. [20:23:54] * Coren blushes, and shuts up. [20:24:02] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 759312 bytes in 2.009 second response time [20:24:05] the magic of wiki! [20:24:09] Migrating is weird about base images. It might be that sometimes it does it right, if it finds the right base? In theory, though, every base image is avaialable everywhere. [20:24:38] hashar: that’s right, the base image is essentially built out of an existing working instance. [20:24:41] andrewbogott: might have been a network hiccup? shinken-wm was complaining at the same time I as having issues [20:24:48] With identifying info blanked out. [20:25:14] andrewbogott: I was expecting something more complicated such as running our puppet manifest in the debootstrap chroot :] [20:25:28] andrewbogott: I’m failing over tools-services-02 (current active) to -01, and let me try recompressing that [20:25:36] andrewbogott: so for the ci isolation that would have been all about adding my puppet role class and bahm new image! [20:25:40] andrewbogott: or should I leave recompressing to you? :) [20:25:51] * YuviPanda adds more things to andrewbogott’s plat [20:26:19] YuviPanda: I can’t emphasize enough how much you have to stop an instance before following those directions :) [20:26:30] So you’ll need a session on virt1000 and a session on labvirt [20:26:32] !log tools failed over tools-services to services-01 [20:26:35] oh [20:26:39] Logged the message, Master [20:26:48] andrewbogott: can I just co-ordinate with you and put the actual recompression on your plate? :D [20:26:55] Oh, sure, that’s fine. [20:27:13] hashar: yeah, should be straightforward, you can make your own image-building instance. [20:27:26] YuviPanda: scroll down [20:27:41] andrewbogott: oh, under ‘done’? [20:27:45] My big paste has IDs which will make this much easier [20:27:51] aaaah [20:27:52] cool [20:28:39] Yeah, let’s only have one entry per instance so we don’t do two different things to any of ‘em [20:28:58] yeah I cleared mine out [20:29:18] andrewbogott: can you !log here as you recompress? [20:29:23] sure [20:29:30] valhallasw`cloud: think we can use this opportunity to move mail to trusty? :) [20:29:53] YuviPanda: euh, I guess? [20:30:37] andrewbogott: yeah I think I will end up doing that and build a Trusty one [20:31:01] andrewbogott: I wanted to reuse OpenStack diskimage-builder utility (which is pretty nice and used by nodepool) but that is going to take too long [20:31:03] valhallasw`cloud: want to give it a shot or should I? :) [20:31:17] YuviPanda: let me see if I can break stuff [20:31:43] hashar: Do you need some kind of automated image-building integration, or just one starter image? [20:31:45] andrewbogott: ok, so tools-services-02 and tools-webproxy-02 are good to go. [20:31:50] I don’t know anything about automating our build tools. [20:31:55] andrewbogott: for now a starter image [20:31:56] valhallasw`cloud: ok. [20:32:08] hashar: ok, then you should definitely use our tools, or just ask me to make one :) [20:32:09] YuviPanda: or should it become jessie while we're at it? [20:32:10] andrewbogott: and middle term a fully integrated system to easily build them as needed :] [20:32:18] valhallasw`cloud: nope, let’s just do trusty. [20:32:48] valhallasw`cloud: precise + trusty mix is bad enough :D also, gridengine doesn’t have a jessie package so I’m not too gung ho about moving to there [20:32:56] valhallasw`cloud: also call it tools-mail-01 and then we can have tools-mail-02 :D [20:33:11] andrewbogott: ideally we would not need to use an instance context to build an image. That should be in some kind of manifests of the building image utility. Also using two different system is awkward :] No complaint there, just stating. [20:33:11] I think we should be more creative in our naming [20:33:24] valhallasw`cloud: like what? [20:33:31] hashar: if you want an automated system you should probably use what OS is using for that. [20:33:38] valhallasw`cloud: let’s not start calling them names and getting attached to them :D [20:33:59] :( I think https://en.wikipedia.org/wiki/List_of_flying_mythological_creatures would be well-suited for mail servers :p [20:34:08] valhallasw`cloud: no :P [20:34:18] YuviPanda: did you decide not to use the etherpad? [20:34:19] andrewbogott: yeah I already took a look at their python software diskimage-builder , it is well thought and easy to hook custom recipes. [20:34:21] pft. lack of creativity here [20:34:50] andrewbogott: oh, I thought anything under *Done was done :D [20:35:03] YuviPanda: heh, sorry :) [20:35:04] andrewbogott: but I will have to spend some time figuring out how to setup our puppet repo, apply our base labs::instance class with the various ec2id / $::realm etc context. Not an easy task [20:35:14] YuviPanda: but that’s a better idea, let me rearrange [20:35:14] YuviPanda: or tools-mailrelay-01? [20:35:23] andrewbogott: yeah :) [20:35:26] YuviPanda: that's what the puppet class is called after all [20:35:28] andrewbogott: I put ‘good to go’ under 3 instances [20:35:30] valhallasw`cloud: +1 sure [20:35:43] valhallasw`cloud: tools-mail-relay-01? or mailrelay? [20:35:45] either works for me [20:35:52] puppetclass is mailrelay [20:35:58] sure then. [20:36:03] let’s go with that :) [20:37:19] What SQL query will get me the intersection of page IDs that are in both Category A and Category B? [20:37:21] YuviPanda: ok, how’s that? [20:37:34] andrewbogott: much better! [20:37:37] andrewbogott: you got the wrong webproxy tho [20:37:49] fixed [20:37:52] I was just going to say: you should doublecheck the ‘ready to be compressed’ section :) [20:38:00] ok, it’s right now? [20:38:12] andrewbogott: as a rule of thumb, anything with a 208.* actual public IP shouldn’t be compressed. [20:38:15] we’ll switch over before [20:38:17] andrewbogott: yes, it’s right now [20:38:20] ok [20:38:55] valhallasw`cloud: do you want to do it or should I take a shot? [20:39:06] YuviPanda: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000bac.eqiad.wmflabs [20:39:23] !log tools created tools-mailrelay-01 [[Nova_Resource:I-00000bac.eqiad.wmflabs]] [20:39:25] valhallasw`cloud: \o/ sweet. [20:39:28] Logged the message, Master [20:39:49] puppet classes are not right, though. let me see... [20:39:55] ~log tools stopping, shrinking, restarting tools-services-01 [20:40:14] :D [20:41:02] YuviPanda: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000000d1.eqiad.wmflabs lists base, role::labs::instance, exim::simple-mail-sender, sudo::labs_project, role::labs::tools::mailrelay, role::labs::lvm::biglogs [20:41:09] YuviPanda: you should let me move things to ‘done’ otherwise I’ll get confused. [20:41:19] andrewbogott: I didn’t move anything [20:41:24] I can't find exim::simple-mail-sender, sudo::labs_project and role::labs::lvm::biglogs [20:41:32] ok, then, /I/ should let myself do that [20:41:36] andrewbogott: :) [20:41:38] and base and role::labs::instance are selected but don't show up at https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000bac.eqiad.wmflabs [20:41:47] valhallasw`cloud: base and role::labs::instance are default [20:41:57] valhallasw`cloud: also don’t trust Nova_Resource. always trust Special:NovaInstance [20:42:00] oh, maybe I should just be patient, base and role::labs::instance are there now [20:43:02] hurray apt-get errors [20:43:17] :) [20:43:22] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=tools&instanceid=7f99aa1d-4ee3-4256-9b70-871271501600®ion=eqiad [20:43:31] that's not supposed to happen, right? :P [20:43:33] !log tools stopping, shrinking, starting tools-static-02 [20:43:39] Logged the message, dummy [20:43:51] valhallasw`cloud: I’ve never looked at the console output ever :D always just ssh in.. [20:45:00] YuviPanda: uh, okay [20:46:03] I also don't get why puppet happily continues even though everything is breaking [20:46:15] sometimes it takes a few runs to ‘get right' [20:46:46] ... really? that's just horrible [20:46:59] we can fix it but nobody’s bothered. [20:47:00] PROBLEM - Host tools-static-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.32) [20:47:13] it's probably puppets random ordering :P [20:47:14] for example: things in our local repository don’t work until second puppet run [20:47:18] not really. [20:47:25] we can fix it if we want to - we just have to order them explicitly [20:47:29] *nod* [20:47:40] needing a few runs is a price to pay for not doing work :D [20:47:46] first fix sources.list, then update, then install packages [20:48:03] yeah but apt-get update is not in control of puppet but is run by puppet-run which is outside puppet :D [20:48:49] ...what? [20:48:50] andrewbogott: oh, tools-webproxy-01 is done? [20:48:55] I don't even want to know :P [20:49:03] this is like recursive make [20:49:09] 'just run make a few times, it'll work... eventually' [20:49:32] YuviPanda: I… think so? I can double check [20:49:42] valhallasw`cloud: https://gerrit.wikimedia.org/r/#/c/179082/ for (very good) reason [20:49:44] RECOVERY - Host tools-static-02 is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [20:49:44] I'm having trouble sshing into a new instance for which I'm a project admin. See my ssh config here. https://gist.github.com/halfak/c5ffed354c67552aebe7 I can ssh to other instances just fine. [20:49:52] Instance is wikibrain-big [20:50:18] I'll update the gist with an "ssh -vv" in a moment [20:50:24] YuviPanda: I see [20:50:38] andrewbogott: ah ok, I saw that it moved to Done but didnt’ see a !log so was wondering :) also I am adding more things in the ‘ready to compress’ list [20:50:44] * halfak waits for it to time out [20:50:52] YuviPanda: but what sets sources.list then? that should also be a layer above then... [20:51:04] YuviPanda: I’m not seeing much savings with a lot of these. I’ll finish up the ones that you’ve stopped, but going forward let’s skip anything that’s small or medium. [20:51:09] valhallasw`cloud: no, puppet sets it. hence you need to two puppet-runs [20:51:47] andrewbogott: that might be specifically the case with tools-static because it does end up using a lot of space for the cdn. I think the others should compress better... [20:51:58] tools-webproxy* hardly uses any space and neither does tools-services* [20:52:02] ah, ok... [20:52:04] gist updated: https://gist.github.com/halfak/c5ffed354c67552aebe7 [20:52:13] Can anyone help me figure out what's up? [20:53:14] !log tools stopping, shrinking, restarting tools-shadow [20:53:20] Logged the message, dummy [20:53:48] halfak: give me a minute [20:55:58] andrewbogott, sure. Thanks [20:56:13] * halfak has too many pings happening at the same time. [20:57:07] o/ Shilad [20:57:26] andrewbogott said he'd take a look in a minute. [20:57:35] !log deployment-prep KILL KILL KILL DEPLOYMENT-LUCID-SALT WITH FIRE AND BRIMSTONE AND BAD THINGS [20:57:40] Logged the message, Master [20:57:41] Shilad, I've shared this gist with the channel: https://gist.github.com/halfak/c5ffed354c67552aebe7 [21:00:25] * halfak runs off to a meeting. [21:00:40] If I don't respond, Shilad knows what's up (about as much as I do anyway). [21:01:16] !log tools failover tools-static to tools-static-02 [21:01:21] Logged the message, Master [21:04:09] halfak, Shilad, I don’t know what’s up with that instance, you should just delete it and start anew. [21:04:19] OK! That's easy. [21:04:19] If it happens twice then it’s interesting :) [21:04:23] Will do. [21:04:25] Thanks! [21:04:34] andrewbogott, thanks for taking a look. [21:06:00] !log stopping, shrinking and starting tools-exec-catscan [21:06:14] !log tools stopping, shrinking and starting tools-exec-catscan [21:06:18] !log tools failover tools-webproxy to tools-webproxy-01 [21:06:45] stopping, is not a valid project. [21:06:45] Logged the message, dummy [21:06:48] well spotted, labs-morebots [21:06:49] Logged the message, Master [21:07:55] PROBLEM - Host tools-exec-catscan is DOWN: CRITICAL - Host Unreachable (10.68.17.23) [21:10:11] YuviPanda: huh. the console output suggests puppet is done, but apt-get is still working hard. Oh, that's from a cron puppet. [21:10:22] valhallasw`cloud: pretend the console doesn’t exist :) [21:10:25] that’s what I usually do [21:11:12] *nod* [21:11:17] RECOVERY - Host tools-exec-catscan is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [21:11:18] ugh super slow host [21:11:30] !log tools shrinking tools-exec-gift [21:11:34] Logged the message, dummy [21:11:34] YuviPanda: also, shinken needs to be updated with the new hosts, I think? [21:11:43] valhallasw`cloud: it does that automatically on next run. [21:11:46] so give it a few minutes. [21:11:47] YuviPanda: oh, good. [21:12:09] based on what, exactly? if I shut down -mailrelay-01 now, shinken will panic? [21:12:13] PROBLEM - Host tools-exec-gift is DOWN: CRITICAL - Host Unreachable (10.68.16.40) [21:12:28] valhallasw`cloud: based off LDAP. look at ‘shinkengen’ in operations/puppet repo :) [21:12:34] valhallasw`cloud: yeah, it’ll panic [21:12:37] hm, okay [21:12:39] good to know [21:13:35] valhallasw`cloud: do you want me to merge your exim patch while we’re at it? [21:13:43] sure [21:13:55] it's 100% bug free now! [21:14:09] oh [21:14:13] has a -1: https://gerrit.wikimedia.org/r/#/c/205914/ [21:14:14] oh! we should do something with the mails still in the queue before we kill -mail [21:14:33] oh, /that/ exim patch [21:14:37] !log tools shrinking tools-static-01 [21:14:38] valhallasw`cloud: we can if we want to basically merge that, and iterate to fix - we can just disable puppet on tools-mail [21:14:41] Logged the message, dummy [21:15:06] YuviPanda: maybe you should read my comment there and respond :-p [21:15:28] it's more philosophical question than a realy issue (apart from the nitpicks) [21:16:03] I'll fix the nitpicks while you ponder [21:16:52] andrewbogott, I deleted the instance and created a new one (wikibrain0), but I'm still seeing the same behavior (ssh hangs). Do you have any other ideas I could try? [21:17:02] RECOVERY - Host tools-exec-gift is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [21:17:23] valhallasw`cloud: doing now [21:18:09] !log tools shrinking tools-webproxy-02 [21:18:14] Logged the message, dummy [21:19:06] YuviPanda: eh, remove quotes? [21:19:22] valhallasw`cloud: yeah, the quotes are because the PHP YAML library we use for wikitech is horrible and puts quotes around for no reason [21:19:26] and the file was originally from there [21:19:32] yaml doesn’t need quotes for :: [21:19:42] YuviPanda: I'm thinking more in terms of 'quote strings do not quote other things' [21:19:52] valhallasw`cloud: yeah, I was talking about quoting the keys. [21:19:53] because now core is a string but true is a bool [21:20:00] ok [21:20:12] I don’t particularly mind not quoting / quoting but quoting keys when needed gets on my nerves. [21:20:21] ok [21:21:32] !log tools backup crontabs onto NFS [21:21:36] Logged the message, Master [21:22:17] sadly there’s nothing we can do to failover tools-redis yet :( [21:22:19] Shilad: try now? [21:22:26] * YuviPanda considers nutcracker for that at some point [21:22:28] not today tho [21:22:40] nutcracker + sentinel [21:22:48] YuviPanda: eh, maildomain is where gridmaster is defined, so that's fine [21:22:54] andrewbogott, Victory! [21:22:58] valhallasw`cloud: ? [21:23:04] YuviPanda: https://gerrit.wikimedia.org/r/#/c/205914/7/manifests/role/labstools.pp [21:23:09] What magic did you perform? [21:23:28] Shilad: the last instance wasn’t coming up properly. This one did come up, but the firewall rules for your project didn’t allow ssh. (They should allow it by default, but something went wrong during project creation.) [21:23:30] valhallasw`cloud: why not just let that be in toollabs class too [21:23:36] So keep in mind that now /everything/ is firewalled off except for ssh. [21:23:48] If you want to add web access you’ll need to edit your default security group. [21:24:05] andrewbogott: Got it. Thanks so much for your help. [21:24:12] YuviPanda: because puppet [21:24:18] Shilad: you’re welcome [21:24:26] valhallasw`cloud: haha :) [21:24:47] YuviPanda: I don't think $proxies et al are available in the subclasses, but I'm not 100% sure [21:24:52] valhallasw`cloud: they are. [21:25:01] valhallasw`cloud: active-proxies is used in exec nodes and also in the webproxies class [21:25:03] YuviPanda: can you acccess tools-webproxy-02? Could you before? [21:25:09] hm, okay [21:25:38] andrewbogott: trying now. I definitely could before. [21:25:43] (I’ve an open shell on it from a few mins ago) [21:25:57] andrewbogott: and can access it now too [21:26:01] YuviPanda: I think the answer is 'because it was in labstools.pp before' [21:26:03] and is working as well [21:26:08] YuviPanda: cool, I’ll declare it fixed then [21:26:10] valhallasw`cloud: :D Move it? [21:26:16] YuviPanda: we can also ask 'why did you put the proxy config in class toollabs' :P [21:26:18] andrewbogott: were there any possible issues? [21:26:26] YuviPanda: but the answer to that is 'hiera', I think [21:26:33] valhallasw`cloud: because it’s needed in a lto of places (services-*, webgrid-*, webproxy-*) [21:26:42] I couldn’t ssh [21:26:42] *lot [21:26:51] andrewbogott: oh, hmm. I definitely can. [21:26:58] !log tools shrinking tools-submit | [21:27:02] Logged the message, dummy [21:28:17] YuviPanda: dunno, it's available to templates/mail-relay.erb now as well [21:28:26] !log tools attempting to failover gridengine to tools-shadow [21:28:31] Logged the message, Master [21:28:36] valhallasw`cloud: true but now we’ve ugliness in two places instead of one :D [21:28:43] PROBLEM - Host tools-submit is DOWN: CRITICAL - Host Unreachable (10.68.17.1) [21:28:45] valhallasw`cloud: I hope to revisit that in about… a month. [21:29:10] YuviPanda: whatever. I don't understand puppet enough to understand how this works to begin with, so I'm not touching it with a ten foot pole [21:29:26] valhallasw`cloud: yeah, you can let that be as is if you’d like and I can do a follow up patch :) [21:29:41] valhallasw`cloud: so the question again is - do you want mailrelay to start with ‘new world order’ or do you want to just have it use the current roles? [21:29:52] valhallasw`cloud: I’m ok merging that patch as is, btw. [21:30:07] YuviPanda: 'new world order' = hiera ALL the things? [21:30:19] valhallasw`cloud: new world order: your patch and whatever comes after [21:30:33] oh. I don't care [21:30:35] rather than ‘we switch between exim-light and exim-heavy every other puppet run' [21:30:49] valhallasw`cloud: ah, in that case let’s hold off on your patch for now and do that in later? [21:30:50] well, exim-light and exim-heavy don't exist anymore in 14.04 to begin with [21:30:58] so everything will be broken(TM) anyway [21:31:35] oh [21:31:37] grrr [21:31:55] grrlol [21:32:08] and I'm not going to fix that tonight, because puppet is still buildinggggg [21:32:18] valhallasw`cloud: building where? on wikitech? [21:32:20] ignore that as well [21:32:24] !log tools shrinking tools-redis [21:32:27] no, in /var/log/puppet.log [21:32:28] Logged the message, dummy [21:32:29] basically ignore anything wikitech tells you... [21:32:30] oh [21:32:32] that’s strange. [21:32:35] why? [21:32:44] it has a gazillion gazillion packages to install [21:32:46] no idea. I do that and it works out well for me :D [21:32:48] and only a single cpu :p [21:32:50] ah :) [21:33:00] load average: 1.16, 1.24, 1.15 lalala [21:33:06] hmm, why is the gridengine failover time 10 minutes [21:33:19] Coren: ^ we should reduce that at some point... [21:33:29] I guess we can just move that this time. [21:33:51] !log tools failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work [21:33:54] IIRC 10 minutes is stock config. [21:33:55] Logged the message, Master [21:34:09] You don't want it too low, because flapping of masters is quite detrimental. [21:34:17] Coren: true, but lesser? 3 minutes? [21:34:20] PROBLEM - Host tools-redis is DOWN: CRITICAL - Host Unreachable (10.68.16.18) [21:34:35] YuviPanda: while you're merging stuff, https://gerrit.wikimedia.org/r/#/c/207043/ and https://gerrit.wikimedia.org/r/#/c/204100/ could use a merge [21:34:40] 3 sounds reasonable. [21:34:45] Coren: if I do a clean shutdown on master and do a clean start on shadow will that be ok? [21:34:48] YuviPanda: and all of [21:34:49] https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+AND+(owner:%22merlijn+van+deen+%253Cvalhallasw%2540arctus.nl%253E%22+OR+owner:%22coren+%253Cmpelletier%2540wikimedia.org%253E%22+OR+owner:%22Yuvipanda+%253Cyuvipanda%2540gmail.com%253E%22+OR+owner:%22Tim+Landscheidt+%253Ctim%2540tim-landscheidt.de%253E%22+)+AND+status:open+AND+label:code-review%253 [21:34:49] D0,n,z could use a review ;-) [21:34:54] ugh gerrit urls [21:35:03] YuviPanda: You mean to switch which is current master? [21:35:09] Coren: yeqah [21:35:14] long gerrit url = https://tinyurl.com/ocfs55l [21:35:38] Yeah. service gridenine-master start iirc [21:36:21] RECOVERY - Host tools-submit is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [21:36:36] Coren: trying that now [21:37:11] Hm. Failed to start? [21:37:22] Coren: yup [21:37:27] * Coren checks. [21:37:27] Coren: on tools-shadow [21:38:15] I think the interlock is screwing with us. [21:39:06] sigh [21:39:20] I noticed before that the shadow doesn't like to start unless the master actually *crashed* [21:40:17] yeah, but I guess it shouuld start if explicitly started [21:40:49] Coren: I just started it back on master [21:41:00] RECOVERY - Host tools-redis is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [21:41:07] Hang on, I wrote it down somewhere on wiki. [21:41:29] !log tools shrinking tools-master [21:41:37] Logged the message, dummy [21:41:40] Coren: too late, it’s being shrinked :D [21:41:45] Coren: we should try this again after [21:42:06] valhallasw`cloud: merged one [21:43:06] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [21:43:41] Master is back [21:44:21] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [21:46:12] w00t [21:46:33] andrewbogott: yup, is working fine [21:48:04] YuviPanda: other shrinkage coming up, or are we done for a while? [21:48:13] Well, i guess you’re not done since you’re refactoring the exec and web nodes :) [21:48:19] andrewbogott: I think we’re done for a while [21:48:26] andrewbogott: for shrinking at least yeah [21:48:33] great [21:48:41] andrewbogott: yeah, I’m going to be doing testing-y things on toolsbeta now and then recreate them all [21:51:48] have sweet dreams labs! [21:54:59] dang, the ganglia disk_total reports are weird. Everything is green and blue except for labvirt1005 which is yellow because it has /more/ free space. [21:55:07] “Warning! Your disk is dangerously underfilled” [21:55:27] <^d> "Did you delete something by accident?" [21:57:59] YuviPanda: hurray! now we just need to monitor tools.tools-mail.exim.paniclog.length [21:58:19] and tools.tools-mail.exim.queue.num_frozen and tools.tools-mail.exim.queue.oldest, maybe [21:58:44] valhallasw`cloud: :D we can specify a check for all hosts with the appropriate puppet class / role :) [22:01:53] YuviPanda: please go ahead, I'm not also going to learn shinken [22:02:19] valhallasw`cloud: yeah can you give me some numbers on which to alert for? or we can just look at them for a week and decide on numbers [22:03:31] YuviPanda: paniclog >= 1, num_frozen warn > 15 (or so? not sure), critical > 50 (it's at 53 right now, and we need to fix that because that means mails aren't going out to the right people) [22:03:42] right [22:03:55] valhallasw`cloud: I’m going to do the /tmp and swap patches now and I’ll do this after? [22:04:02] let me copy that to the ticket [22:04:33] oldest not completely sure, probably warn after 48 hours = 172800 and critical after a week = 604800 [22:04:53] although I guess it might purge frozen mails after a while, as the current oldest is slightly more than 48 hours [22:05:05] and it wants to deliver some of them to @toolserver.org, so that's not going to work [22:05:23] we supposedly have redirects in place for that [22:05:32] not for mail [22:05:36] just for web afaik [22:05:45] valhallasw`cloud: for mail too, although that’s unpuppetized / not in a repo [22:05:49] oh. [22:05:49] valhallasw`cloud: Coren set that up sometime ago [22:06:01] well, maybe it's not working correctly then ;-D [22:06:06] not surprised :) [22:06:17] anyway, needs human intervention, so we should get warnings [22:06:21] yeah [22:06:42] YuviPanda: for localcrontab, as far as I'm concerned we can set critical to > 0, but that would mean tools-trusty immediately goes to critical [22:06:55] so maybe just warn > 0 and no critical? [22:07:08] it's not important, juts needs intervention [22:07:16] valhallasw`cloud: I was going to delete tools-trusty…. [22:07:16] :D [22:07:30] -_-' [22:07:40] mail to @toolserver.org work, for the most part. [22:07:42] clear up the crontabs first =p [22:07:52] but many of them are no longer valid as they were not updated in ages. [22:13:24] andrewbogott: re: restarts, can you do quarry and wdq projects when I’m around? [22:13:49] PROBLEM - Puppet failure on tools-mailrelay-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:14:36] :{ [22:14:50] YuviPanda: yes, I can do them now if you like. [22:15:03] andrewbogott: tomorrow? too much excitement for the day already :) [22:15:19] YuviPanda: ok… [22:15:28] Coren: what’s the swap / overcommit ratio for webgrid nodes? [22:15:39] YuviPanda: respond to my email so that I have a record? [22:15:43] andrewbogott: i want to start creating the new exec nodes tonight. they take like an hour or two for puppet to run [22:15:44] andrewbogott: sure [22:16:06] RECOVERY - Host tools-login is UP: PING OK - Packet loss = 0%, RTA = 9.92 ms [22:16:20] andrewbogott: done [22:16:26] andrewbogott: ^ hmm, ghost ldap entries still persist. [22:16:38] YuviPanda: I didn't put h_vmem on the webgrid nodes under the presumption of shared executables (which held true for lighttpd/php) and limited only by jobs there. [22:16:40] yeah, it’s unkillable [22:17:33] Coren: actually, this is something me and scfc were thinkign about earlier, and I’m wondering if I should just do that now... [22:17:53] Coren: which is to not have separate webgrid nodes for lighty / others at all. Just put all nodes in one way, make them uniform, and just let them be... [22:18:02] YuviPanda: I purged from ldap, maybe that will… do it? [22:18:14] Coren: I don’t think we’re constrained by hardware anymore, so we can always just add more nodes if needed. [22:18:29] andrewbogott: yeah, should have. [22:19:07] YuviPanda: That's going to cause some serious underuse - web servers tend to be *big* and have a lot of shared pages - the h_vmem needed to have them run will mean a couple dozen at most on a node whereas with slot limits, you can fid 50-100 lighttpd on a node. [22:19:36] Coren: well, the hardware’s all virtual - anything that doesn’t get used doesn’t get used... [22:20:02] except for disk space, today [22:20:05] :D [22:20:08] yeah, but that’s an exception [22:20:19] That's true to a point, but where 5 webgrid nodes were needed you need to count at least 15 generic nodes. [22:20:47] we can just give them biggest spaces. tune VMEM appropriately [22:21:19] That's my point - for that use patten vmem is a horrible metric. [22:21:34] well, for *every* use pattern vmem is a horrible metric but let’s not go there today :) [22:21:53] I'm all for simplification when it makes sense, but I don't believe this is one of those cases. [22:22:18] how about we just do it and then move back to a separate queue for lighttpd if it, uh, doesn’t fit? [22:22:38] ideally in the long run I only want one type of exec node and nothing separate for web. [22:22:48] scratch that, ideally in the long run I want gridengine to die in a fire but that’s at least a year out [22:24:16] I had forgotten about how long it takes for precise instances to provision first time :| [22:24:28] YuviPanda: You say "ideally" but I don't think I see an ideal to reach for there. That's just change for the sake of change. [22:24:43] I disagree - GridEngine is an unmaintained piece of abandonware. [22:24:57] and has no isolation whatsoever. The entire port isolation problem would go away with containers [22:25:02] " ideally in the long run I only want one type of exec node" [22:25:05] the world has moved on, mostly [22:25:07] oh [22:25:08] :) [22:25:13] yeah, I did say to scratch that :) [22:25:42] boo precise. [22:26:39] Coren: I’m going to do tools-exec-1xx for precise nodes and tools-exec-2xx for trusty nodes. [22:26:46] now how do I even test these... [22:26:56] "test"? [22:27:05] the /tmp creation patch [22:27:09] and then the swap creation patch [22:27:18] YuviPanda: so… things are stable now, albeit stupid, correct? [22:27:23] andrewbogott: for tools? yes. [22:27:30] or in general? [22:27:31] YuviPanda: how about tools-exec-12xx and 14xx? :P [22:27:52] valhallasw`cloud: oooo, that’s nice too. let me do that. [22:27:52] valhallasw`cloud: thanks [22:28:48] Coren: valhallasw`cloud so I’m going to do something really horrible: 1. disable puppet on all exec nodes that we have now, 2. create new instances, test on them, 3. create enough new instances 4. kill old instances [22:28:53] (2) might be an iterative process. [22:28:55] any objections? [22:29:06] why disable puppet? [22:29:18] valhallasw`cloud: so if I fuck up a change it doesn’t fuck up current nodes [22:30:07] Doesn't sound insane. And I'm not worried about the labs_lvm::volume use - that's been well tested. [22:30:20] The only real caveat is if you use 100%FREE somewhere you need to make sure it's last. [22:30:33] YuviPanda: eh, okay. I don't get what needs to be changed, but sure. [22:30:52] valhallasw`cloud: 1. add /tmp as a volume, 2. add 3xMemory swap [22:30:59] Coren: yup. no 100% anywhere tho [22:31:05] YuviPanda: ah, good [22:31:11] I don't remember seeing your swap change though. [22:31:35] Coren: yeah haven’t written it yet >_> [22:31:51] Coren: not sure how exactly to do that either. lvm a swap partition and then swapon exec, I guess. [22:32:55] YuviPanda: Yeah, make-instance-vol doesn't have a mkswap option atm. [22:33:02] yeah [22:33:07] I could also just do two execs [22:33:29] Coren: hmm, I need to mkswp with lvm somewhow, right? since all the space is unallocated but in LVM? [22:33:32] or am I reading that wrong? [22:34:07] mkswap is the mkfs equivalent for swap - you can do that on the volume. Although, maybe mkfs -t swap is actually smart enough to do it -- lemme check. [22:34:48] yeah, I’m hoping it can [22:35:35] anyone using multiwiki with a labs instance? i enabled labs-vagrant, provisioned, enabled cirrussearch, provisioned, and now am attempting to run some tests. cirrussearch enables a cirrustestwiki, but it doesn't seem to work. happy to debug but checking if anyone else has already fixed this :) [22:35:54] ebernhardson: werdna was doing multiwiki stuff [22:35:56] YuviPanda: Ah, no it can't but that's easily fixed. [22:36:31] ebernhardson: I have some multiwiki labs-vagrant instances. can you elaborate on "doesn't work"? [22:36:35] YuviPanda: ok thanks, i'll poke him (probably in the morning, its a bit late if hes still in europe) [22:37:00] oh, except he isn't working for us anymore :S oh well i bet hes still randomly around irc [22:37:01] * YuviPanda grumbles about salt again [22:37:32] “sure, run every command about twice and it will be ok" [22:38:11] bd808: connecting from localhost to cirrustestwiki.local.wiki.wmftest.net reports itself as devwiki, rather than cirrustestwiki [22:38:42] bd808: the same vagrant config works just fine locally, but its not picking it up on my labs instance [22:38:56] ah. that's kind of by design. On a labs-vagrant instance the hostname will be cirrustestwiki- [22:39:55] bd808: ahha, i suppose that makes sense although unexpected :) [22:40:51] ebernhardson: It happens because of /vagrant/puppet/hieradata/environment/labs.yaml's mediawiki::multiwiki::base_domain setting [22:40:53] bd808: so thats going to expect to be on wmflabs.org? [22:41:29] ok, yup [22:41:45] multiwiki does this in labs so that you can actually test/use centralauth and the other wikis [22:42:18] otherwise you wouldn't be able to get in from the outside [22:42:31] i was just going to ssh -L 8080:localhost:80 phantomcirrus.eqiad.wmflabs :) [22:42:58] but certainly i can see for most it would be more useful to be public [22:45:54] YuviPanda: The only caveat to using labs_lvm for swap is that you won't get an implicit swapon -- you'll have to add one yourself. [22:46:01] Coren: yeah. [22:47:02] Hm [22:47:21] Actually, you know what? This is crap as-is. Lemme make a labs_lvm::swap resource. [22:47:31] Coren: :) ok [22:48:12] Coren: thank you :) [22:52:16] question: are phabricator accounts and wikitech/labs accounts the same or different? i.e. if somebody has phabricator account, would he be able to use it on wikitech/labs? [22:55:45] SMalyshev: they’re differentish. labs is ‘ldap account’, same as gerrit. phabricator can be either ldap (and same as wikitech/gerrit/labs) or Mediawiki OAuth (and same as enwiki, SUL, etc) [22:56:40] aha, so if it's MediaWiki User in userinfo then he needs another account for ldap [22:58:49] RECOVERY - Puppet failure on tools-mailrelay-01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:59:10] YuviPanda: thanks [23:03:05] (03CR) 10Jforrester: "> This version hasn't been deployed yet, no (mainly because it's the same data, so there's no reason to deploy). What's the issue you're s" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/196852 (owner: 10Awight) [23:06:35] valhallasw`cloud: wikibugs seems awfully quiet, it might have tripped over the tools-redis migration earlier. My redis pubsub feed ended up in strage state [23:07:24] sitic: yeah, I should setup nutcracker... [23:12:35] * YuviPanda smacks puppet over the hear [23:12:38] *head [23:12:40] why are you so slow? [23:13:58] * sitic hopes that the guy who build redis continues to build his new distributed message broker https://github.com/antirez/disque [23:21:39] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:21:41] YuviPanda: manifest? [23:22:10] What is it and how will it replace webwatcher? [23:22:16] hI T13|mobile [23:22:39] I’m in the middle of creating new exec nodes. Can I get back to this later? [23:22:51] T13|mobile: until then: please read https://phabricator.wikimedia.org/T94883 and https://phabricator.wikimedia.org/T90561 [23:22:52] thank you [23:23:30] Sounds good. [23:23:35] cool [23:23:45] it’ll be announced in a week or so, but has been runnign and restarting webservcies for the last few weeks [23:28:13] PROBLEM - Puppet failure on tools-webgrid-08 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:28:33] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [23:28:41] Except for xTools which I manually restart virtually every day. So, I am all for an improvement there. [23:29:01] yes, that involves xtools’ code being terrible [23:29:11] Very much. [23:29:17] so the plan for that [23:29:25] is to have an endpoint that can be specified in the manifest [23:29:32] and if that’s not returning a 200 response, then restart webservice [23:29:42] which should deal with leagacy codebases like xtools [23:29:49] in the meantime musikanimal turned on webwatcher [23:30:08] I'm basically waiting on xtools.wmflabs.org being set up so I can rewrite xtools san bugs as I migrate. [23:30:16] who are you waiting for? [23:30:24] Nakon [23:30:27] ok [23:30:30] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:31:04] He's the one we delicated to setting the instance up because he wants it to use heavy or something. [23:31:37] %s/delicated/deligated/ [23:31:41] O.o [23:33:32] xtools’ problems have never been lack of resources but the fact that it’s a big codebase worked on by many people over time with lots of hacks. [23:33:35] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:33:38] no amount of resources is going to fix that, but good luck [23:34:47] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:37:48] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:38:54] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:38:58] I'm pretty sure any problem can be solved by throwing more hardware at it [23:39:59] legoktm: not really. imagine: your code deadlocks (not the case here, probably) - no amount of hardware is going to fix that [23:40:14] PROBLEM - Puppet failure on tools-webgrid-generic-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:40:51] YuviPanda: surely you could overpower your db servers so they don't deadlock! [23:41:19] * YuviPanda builds bridge over legoktm [23:42:04] * legoktm throws hardware at the bridge and watches it collapse [23:58:31] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [23:59:43] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0]