[00:08:35] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:15:50] 10Tool-Labs-tools-Other, 7I18n: Find or create a CLDR data parser for Intuition or PyIntuition - https://phabricator.wikimedia.org/T102231#1784306 (10Krinkle) As mentioned at and , plural support alr... [00:19:16] PROBLEM - Puppet failure on tools-exec-1220 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:22:21] andrewbogott: you built invisibile unicorn the last time, right? where were the debian/ files for it? did you check them into anywhere? [00:23:12] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [00:23:29] there are instructions for building in the readme. I believe that setuptools generate the /deb files [00:25:44] aaaah [00:25:46] ok [00:25:49] andrewbogott: I'm actually going to just move that into puppet [00:25:51] since it's just one file [00:26:22] YuviPanda: there are a lot of python tools in gerrit that are built that way [00:26:31] so probably best to leave it be unless you need to hand-tune the file [00:26:38] no I just merged a patch to it [00:26:48] a patch to a file in deb/ [00:26:49] ? [00:26:57] no a patch to the api.py file [00:27:12] ah, ok [00:27:13] and my instinct is to just move it to a file since it's simple enough and will probably not become multi-file... [00:27:21] and it'll be less work than actually building package etc as well [00:27:23] it’s a generated file though [00:27:36] not the debs [00:27:36] oh! [00:27:38] api.py [00:27:40] you mean instead of building it [00:27:43] yes, that’s fine :) [00:27:45] yes [00:28:04] too many pronouns :p [00:28:07] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [00:28:11] :D [00:28:13] yeah [00:28:15] sorry [00:43:40] RECOVERY - Puppet failure on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [00:59:18] RECOVERY - Puppet failure on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [01:03:12] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [01:08:06] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [01:52:01] 6Labs, 6Phabricator, 7Puppet: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784550 (10Negative24) @mmodell Not at a computer but isn't that just the tracking tag var? [02:10:23] 6Labs, 6Phabricator, 7Puppet: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784582 (10mmodell) @negative24: Production isn't deployed via puppet anymore. I just need to set up labs instances to clone the deployment repo instead of the individual tags. [02:13:51] 6Labs, 6Phabricator, 7Puppet: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784583 (10Negative24) Ah, ok. (I'm a little bit curious of how the deployments are deployed; are they just pulled via git or something else?) [08:43:59] 6Labs, 6Phabricator, 7Puppet: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784904 (10mmodell) negative24: #scap3 [09:38:09] Change on 12www.mediawiki.org a page OAuth/scn was created, changed by Sarvaturi link https://www.mediawiki.org/wiki/OAuth/scn edit summary: Created page with "Putìssitu èssiri a l'arricerca d'unu dê siguenti:" [11:07:46] 6Labs, 6Discovery, 10Maps: Update wiki page with OSM Postgres access info - https://phabricator.wikimedia.org/T116355#1785173 (10akosiaris) 5Open>3Resolved a:3akosiaris Page updated with the correct info [13:21:29] (03CR) 10Jean-Frédéric: [C: 032] Keep alive connection to MySQL database using Ping [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/250202 (https://phabricator.wikimedia.org/T117045) (owner: 10Jean-Frédéric) [13:21:39] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1785377 (10Krenair) >>! In T117394#1774385, @Krenair wrote: > IIRC, labswiki jobs are supposed to be running locally on silver only... Actually, we... [13:22:15] (03Merged) 10jenkins-bot: Keep alive connection to MySQL database using Ping [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/250202 (https://phabricator.wikimedia.org/T117045) (owner: 10Jean-Frédéric) [13:27:14] 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 5Patch-For-Review, and 3 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1785381 (10coren) 5Open>3Resolved The issue is definitely distinct; no test case I've done managed to replicate the original issue (newly replaced... [13:40:54] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1785403 (10jcrespo) These were the first occurrences: ``` { "_index": "logstash-2015.10.31", "_type": "mediawiki", "_id": "AVC8f0N1lAIL90ZzMe... [15:36:38] PROBLEM - Host tools-exec-1221 is DOWN: CRITICAL - Host Unreachable (10.68.16.84) [15:37:50] I need another public IP for the SignWriting project. Are they still available? [15:55:21] 6Labs, 10Labs-Infrastructure, 6operations: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1785605 (10chasemp) a:5RobH>3Papaul for https://phabricator.wikimedia.org/T117097#1783632 thanks papaul [16:18:48] I am requesting another public IP for the SignWriting project. https://wikitech.wikimedia.org/wiki/Nova_Resource:Signwriting . The new public IP is for a new server that creates svg images for viewing and regular expression for searching. [16:31:52] slevinski: You can probably use Special:NovaProxy for that? [16:32:41] slevinski: but I may be missing something from your setup. In any case, the best way to to file a Task in the #labs project, adding andrewbogott as CC [16:34:17] Thanks. I haven't created a new server on labs in a few years. I will look into the nova proxy. [16:37:58] (03CR) 10Hashar: "recheck" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/104917 (owner: 10Hashar) [16:39:09] some hosts when dead integration-slave-precise-1014 tools-exec-1221 deployment-db1 (the later is the beta cluster mysql db) [16:42:48] 6Labs, 10Tool-Labs, 6Wikisource: PDF/Epub output not done of sub page of Bengali Wikisource - https://phabricator.wikimedia.org/T117879#1785762 (10jayantanth) 3NEW [16:43:19] valhallasw`cloud: Thanks. It looks like the nova proxy will work. [16:58:27] 6Labs, 10Labs-Infrastructure, 6operations: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1785789 (10Papaul) labtestmetal2001 ge-5/0/8 NIC1 ge-5/0/30 NIC2 labtestvirt2001 ge-5/0/17 NIC1 ge-5/0/ 31... [17:03:20] 6Labs, 10Tool-Labs, 6Wikisource: PDF/Epub output not done of sub page of Bengali Wikisource - https://phabricator.wikimedia.org/T117879#1785797 (10valhallasw) It's not clear to me what the issue is. What are you trying to achieve, how are you trying to achieve it, what is the expected result and what is the... [17:03:43] it seems an instance of ours has disappeared, estest1003.search.eqiad.wmflabs. It's in wikitech, and the console output looks sane, but its not responding to pings or ssh. I can of course try and reboot it in the wikitech interface but is there anything i should check first so i can know why it failed and try to not do that again? [17:03:55] or perhaps its unrelated to us, and some labvirt instance had issues? [17:04:04] andrewbogott: ^ [17:04:35] ebernhardson: which labvirt host is this? [17:04:49] valhallasw`cloud: i'm not sure how to tell which host an instance is running on [17:05:16] should be on the instance information page, I think.... [17:05:32] valhallasw`cloud: just http://i.imgur.com/uhmS9pD.png [17:05:34] should be on the instance information page, I think.... [17:05:45] iirc there were some issues with labvirt1004 [17:05:45] but it's on labvirt1002 [17:06:39] https://wikitech.wikimedia.org/wiki/Nova_Resource:Estest1003.search.eqiad.wmflabs [17:06:52] (sorry, irc was sloooow) [17:06:54] Can a op look at T117881? deployment-db1 is down [17:06:58] ahh, i've never clicked through to that page [17:07:07] Coren: YuviPanda https://phabricator.wikimedia.org/T117881 deployment-db1 is dead for some reason :-( [17:07:11] maybe we have lost a compute node [17:07:19] ok well i'll just reboot it then. if it happens again i'll be back :) [17:07:26] hasharmeeting: Lemme check. [17:07:44] Coren: I noticed some other instances went down as well such as integration-slave-precise-1014 [17:07:52] deployment-db1 is also on labvirt1002 [17:07:55] oh maybe its not just me? [17:07:55] and tools-exec-1221 [17:08:09] and so is integration-slave-precise-1014 [17:08:26] and tools-exec-1221... [17:08:44] assuming the Nova resource pages are up to date, that is [17:08:52] That does look bad. I'm looking at it now. [17:10:31] if one can get a list of instances running on it and see it to labs-l or on some task, that could be handy [17:11:45] The server itself and up and at least pretends to be working. Digging. [17:11:58] * hasharmeeting blames disk [17:13:08] list of hosts on virt1002: https://wikitech.wikimedia.org/w/index.php?title=Special:Ask&q=%5B%5BInstance+Host%3A%3Alabvirt1002%5D%5D&p=format%3Dbroadtable%2Flink%3Dall%2Fheaders%3Dshow%2Fsearchlabel%3D%E2%80%A6-20further-20results%2Fclass%3Dsortable-20wikitable-20smwtable&limit=500&eq=no [17:13:23] for what it is worth, in Horizons I can not change project I am always redirected back to the default project [17:13:30] might indicate some issue with nova / whatever central api [17:13:36] valhallasw`cloud: I never trust wikitech for this. [17:13:43] fair enough [17:14:01] Not all hosts are down on virt1002 [17:14:04] all those Ci-jessie-wikimedia-*.contintcloud.* got deleted most probably [17:14:19] * Coren added the list on the task [17:14:49] 6Labs, 10Tool-Labs, 10Beta-Cluster-Infrastructure, 7Database: deployment-db1 is down - https://phabricator.wikimedia.org/T117881#1785848 (10valhallasw) [17:15:02] deployment-mathoid went down [17:16:00] hasharmeeting: I lost my backscroll — was/is there a problem? [17:16:07] andrewbogott: coren on it [17:16:23] andrewbogott: A number of instances are exploding, all on virt1002 atm, but definitely not all of them. [17:16:24] andrewbogott: bunch of instances went down for the last half hour or so [17:16:28] 6Labs, 10Tool-Labs, 10Beta-Cluster-Infrastructure, 7Database: Several hosts on virt1002 are down - https://phabricator.wikimedia.org/T117881#1785861 (10valhallasw) [17:16:53] Coren: the drive is full [17:17:00] ah [17:17:04] andrewbogott: And that would cause that issue? [17:17:14] could it be caused by nodepool instances (contintcloud project) [17:17:17] apparently the scheduler is still broken in that same damn way :( [17:17:19] Coren: it causes all kinds of disasters. [17:17:38] yeah, but the damn scheduler is supposed to prevent this... [17:17:47] andrewbogott: I thought that was gone in K? [17:17:49] anyway, hashar, can you delete those instances for the moment please? [17:17:54] i.e. the scheduler is supposed to check the compute node disk space? [17:18:11] 0d 21h 27m 53s 3/3 DISK CRITICAL - free space: /var/lib/nova/instances 0 MB (0% inode=78%): :( [17:18:27] andrewbogott: There a few tools instances that have already been brought down; I can migrate them to 1005 to ease the pressure. [17:18:44] Why didn't that page? [17:19:07] Coren: it is. [17:19:54] oh wait, no... [17:19:55] hm [17:20:09] hasharmeeting: I don’t understand the context for your last paste [17:20:16] oh sorry [17:20:24] that is an icinga alarm for labvirt1002 [17:20:33] from the web interface [17:20:47] ah, yes, so it is [17:20:48] andrewbogott: How do you feel about 1005? Good enough that I can migrate a couple of instances to it? [17:20:58] hasharmeeting: so can you delete instances? [17:21:06] Coren: yes, go ahead. Migration takes ages though [17:21:07] 6Labs, 10Tool-Labs, 6Wikisource: PDF/Epub output not done of sub page of Bengali Wikisource - https://phabricator.wikimedia.org/T117879#1785881 (10jayantanth) I mean to say that full PDF book ( with all sub page as like in ENWS) have not export, the tool export one page( NS:0) only. I want a full PDF of a B... [17:21:12] openstack server list --long doesn't give me the labvirt host [17:21:23] so I have no clue which one are on labvirt1002 :-/ [17:21:37] nova list —all-tenants —host labvirt1002 [17:24:11] andrewbogott: one of the exec nodes is on its way out, that won't harm. [17:24:21] ok [17:25:43] hasharmeeting: I don’t think it was nodepool. Possibly some instance suddenly grew its disk usage by 100G [17:26:00] feel free to delete nodepool instances [17:26:12] what project would they be in? [17:26:17] nova list —all-tenants 1 —host labvirt1002 got me a list of all the instances [17:26:20] contintcloud [17:26:39] yeah, so none of your instances are on 1002, right? [17:26:43] 10Tool-Labs-tools-Other, 6Wikisource: PDF/Epub output not done of sub page of Bengali Wikisource - https://phabricator.wikimedia.org/T117879#1785905 (10valhallasw) [17:26:51] That makes me think the scheduler was working right (not dropping any new instances on that box) [17:26:51] I mean [17:26:54] And something else happened. [17:26:56] it listed all my contint cloud instances [17:27:00] seems bugged [17:27:18] oh yeah, that command is broken when scoped with a project. [17:27:20] grrrr [17:27:25] everything is terrible! [17:28:12] ah https://phabricator.wikimedia.org/P2279 [17:28:17] nova list --host labvirt1002 --fields name,status,hostId [17:28:28] assuming hostId is the id of the compute node [17:28:37] seems to suggest they run on different labvirt nodes [17:30:10] and `openstack server list --long` doesn't yield the hostname :-/ [17:31:08] Coren: did you delete an instance? that drive is still at 100% [17:31:24] No, but there's a migration in progress. [17:32:03] oh [17:32:04] andrewbogott: In a pinch, we could sacrifice labs-vmbuilder-lucid [17:32:11] we are in a pinch [17:32:24] although, dammit, I moved that already I thought [17:33:00] There are a few other clearly test-ish instances. [17:33:15] ebernhardson: How attached are you to ee-jenkins-test and estest1003? [17:33:38] ebernhardson: you only just created estest1003 right? [17:33:45] Coren: estest1003 would kinda be a pain to lose, because its part of an es cluster [17:33:45] like an hour ago? [17:34:18] ee-jenkins-test doesn't matter [17:34:40] ebernhardson: how can estest1003 be important when it’s only a few hours old? [17:35:13] !log editor-engagement deleting ee-jenkins-test [17:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Editor-engagement/SAL, dummy [17:35:31] andrewbogott: the thing is when i add it to the es cluster the cluster then balances the data around [17:35:48] ebernhardson: ok. I would very much like to delete it. Can you depool in anticipation of that? [17:35:51] andrewbogott: its not the end of the world if tis gone, but its [17:36:03] its just annoying, but i'll live [17:36:06] I’m sorry, the scheduler should absolutely not have put it on that box. We’ll have words after the crisis is over. [17:36:08] RECOVERY - Host tools-exec-1221 is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:36:18] ebernhardson: so I can just delete, or do you need to do something first? [17:36:33] (if it’s any consolation — the box will crash anyway, if I don’t delete it :) ) [17:36:37] andrewbogott: tools-exec-1221 is now on 1005 [17:36:47] yeah, one instance up again :) [17:37:00] woo, a glorious 1% of free disk space [17:37:17] andrewbogott: can i get anything off the box? assuming its stuffed [17:37:24] Sorry the tools instances are not more wasteful. :-) [17:37:44] ebernhardson: it’s probably up now, and will survive for a few minutes. So if you need to evacuate, do it now :) [17:37:44] shrug, just delete it i'll reimport everything to the cluster later [17:37:49] ok [17:37:53] thank you, sorry for the screwup [17:38:31] * ebernhardson wonders if he killed this virt by booting up an instance then shuffling another 80G onto it last night... [17:39:16] ebernhardson: that’s approximately what happened, but that virt is already running over 90% so it shouldn’t have allowed any new instances in the first place. [17:40:09] so, ok… Coren, hasharmeeting, Luke081515, is everything working ok now? [17:40:46] * Luke081515 looks [17:41:04] no, sorry. wait a moment [17:41:44] andrewbogott: deployment-db1 looks down, I can't login into beta [17:41:57] (Cannot access the database: Can't connect to MySQL server on '10.68.16.193' (4) (10.68.16.193)) [17:45:21] andrewbogott: I found a couple more to blow away. [17:45:41] Coren: thanks. Can you investigate the ‘why didn’t this page’ question while I revive instances [17:45:42] ? [17:45:58] Sure. [17:46:05] Once I made some free space. [17:47:52] Luke081515: any better now? [17:48:19] andrewbogott: Yeah, great job :) I'm logged in now :) [17:48:28] andrewbogott: Now at 92% [17:48:32] cool, ok, I will try to revive these others [17:48:49] Coren: yep, that’s a lot better. After I get things going I’ll see why the scheduler is still scheduling there :/ [17:49:43] Coren: for your reference… a lot of these instances were switched into ‘paused’ state when the drive filled up. It was kvm that paused them, though, not openstack, so nova doesn’t know that they’re paused. [17:50:02] So, to route around that, I’m logged into labvirt1002 and using libvirt directly: 'virsh resume xxx' [17:50:10] aha! Makes is easier to recognize then. [17:50:27] nova shows these instances in a weird, state, ‘ACTIVE’ but not ‘running' [17:50:34] so it’s easy to see which ones paused. [17:53:16] Wikimedia Labs | Status: some instances were briefly suspended, but all should be normal now | Channel is logged: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/ [17:53:24] um... [17:54:06] RECOVERY - Host tools-webgrid-lighttpd-1405 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [17:54:18] um? [17:56:59] um, I tried to set the topic but just said the topic outloud in the channel instead :) [17:57:22] ok, now, why did this happen... [17:58:32] * Coren looks up with fear in his eyes. [18:00:30] "If the value is set to >1, we recommend keeping track of the free disk space, as the value approaching 0 may result in the incorrect functioning of instances using it at the moment." [18:00:44] So, I guess this is my mistake — that filter really doesn’t care if the physical drive is filling up. [18:00:52] Which they solved by recommending that I keep an eye on it. [18:01:10] So! I will fix that. [18:03:57] "solved" [18:07:20] andrewbogott: As for why no page; the disk_space monitor in base::monitoring::host isn't set critical, nor is there currently a way to make it so. [18:07:34] hm [18:07:55] Give that this outright breaks the labvirts, I'm thinking this should be added. [18:08:20] create a critical option for (defautl false) and use hiera to override should do it [18:08:33] * andrewbogott agrees [18:08:37] chasemp: Yep. That's what I was doing right now. :-) [18:09:47] andrewbogott: thanks for handling the crisis! [18:09:49] Coren: ^ [18:10:17] * YuviPanda shall go to the office for the metrics meeting and then report back from there [18:10:54] YuviPanda: See ya in a bit then [18:10:56] seriously was out for food at the time so thanks :) [18:11:54] andrewbogott: Off the top of your head, what class can I put the hiera value in you know will cover the lavbirts and only the labvirts? [18:15:18] andrewbogott: openstack::nova::compute make sense to you? [18:15:20] 6Labs, 10Tool-Labs, 10Beta-Cluster-Infrastructure, 7Database: Several hosts on virt1002 are down - https://phabricator.wikimedia.org/T117881#1786036 (10dduvall) `deployment-db1` is back up and replication to `deployment-db2` is running. [18:15:38] Coren: yeah, that sounds right to me [18:22:45] sigh...i booted a new estest1003 and it put it back on labvirt2 :S [18:27:36] just delete and retry till it puts it somewhere with space? from ganglia it looks like 5 is the only one thats not hurting for space (>1TB available) [18:30:02] 10Tool-Labs-tools-Other, 6Wikisource: PDF/Epub output not done of sub page of Bengali Wikisource - https://phabricator.wikimedia.org/T117879#1786193 (10Tpt) 5Open>3Invalid a:3Tpt If I have understood well the description it is not a bug but the normal behavior of the tool. See https://github.com/wsexport... [18:40:04] Hey andrewbogott, can you help me figure out why ores-compute.revscoring.eqiad.wmflabs is shut off and how to turn it back on? [18:41:33] halfak: I’m not sure why it’s shut down, but I can start it. [18:41:41] Thank you :) [18:42:59] halfak: can you reach it now? [18:44:07] * halfak tries [18:44:10] Yup! [18:44:14] Thanks andrewbogott [18:44:19] <3 [18:48:09] 6Labs, 6operations: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786298 (10faidon) I don't remember this IRC discussion. Who was attending it? A little more context please? :) In any case, I disagree with that consensus. I think enabling backports fleet-wid... [18:49:58] 6Labs, 6operations: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786306 (10coren) @faidon: That was mostly you and Moritz. Lemme see if I find quotables in my local logs. :-) [18:55:18] 10Tool-Labs-tools-Other, 6Wikisource: PDF/Epub output not done of sub page of Bengali Wikisource - https://phabricator.wikimedia.org/T117879#1786323 (10jayantanth) Could you please anyone try at ENWS main featured article section top right "Grab a download!" PDF ? [18:59:22] 6Labs, 6operations: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786335 (10MoritzMuehlenhoff) Using packages from backport selectively is fine with me, we already do it e.g. with openjdk-8 which we need for the cassandra cluster. It's a valid part of the Deb... [19:08:01] Is the Tool lab grid working? I have a lot of errors like "Unable to initialize environment because of error: denied: host "tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs" is neither submit nor admin host" [19:19:35] estest1003 this time was put on labvirt1, which according to ganglia has 270G available. am i going t break things by using it? [19:20:03] (by using it, i mean i have to start re-importing 350G of elasticsearch indices to the three machines estest1001 1002 and 1003) [19:22:07] it looks like estest1001 is on labvirt7, 1002 is on labvirt2 and 1003 is on labvirt1 [19:22:33] 1001 and 1002 are already using the disk space, they will essentially delete ~100G from each and then start from scratch [19:57:43] hey Tpt [19:57:45] looking into it now [19:57:50] he's gone I guess [19:57:52] I can still fix [20:01:33] Coren: valhallasw`cloud did either of you add lighttpd-1414 to be a submit host? [20:01:40] no [20:01:53] YuviPanda: I just have; I'm writing a task about it now. [20:02:00] ok! [20:02:02] thanks [20:02:07] do !log in the future :) [20:02:17] I got poked by tpt about an issue that had caused. [20:06:27] 6Labs: Tools: Create a check that all web nodes are set as submit nodes - https://phabricator.wikimedia.org/T117906#1786546 (10coren) 3NEW [20:07:10] Ah, I hadn't seen his message here - he /msg'ed me about it pretty much simultaneously. :-) [20:07:29] YuviPanda: was that not on the checklist? [20:07:30] Coren: ok [20:07:36] valhallasw`cloud: it was and I was an idiot and missed it [20:07:44] so that one's all on me [20:08:14] I don't think it's useful to blame; I'm pretty sure it'd be a good idea to have a check for those unsurfaced assertions about the environment though. [20:20:20] Coren: I don't think that's a check we can do anytime soon with our puppet stuff [20:20:45] YuviPanda: can we query a list of hosts easily? [20:21:00] as in? [20:21:05] Coren: andrewbogott can we push the sync up tomorrow foward an hour [20:21:23] chasemp: An hour earlier, you mean? Fine with me. [20:21:32] hour later :) [20:21:38] YuviPanda: as in 'list all tools-webgrid-XXXX hosts, check if all hosts in submit host list' [20:21:39] chasemp: I don’t know which way ‘forward’ is but sure [20:21:47] chasemp: works for me too. [20:21:52] ok thanks [20:22:20] I mean, one could abuse wikitech, buuuut :-p [20:22:23] valhallasw`cloud: easier still to have a check on the host itself. "Am I on the list"? [20:22:30] oh, good point. [20:26:36] the only problem being 'ugh diamond' :-p [20:30:22] 6Labs, 10Tool-Labs: Redirect Dispenser's tools - https://phabricator.wikimedia.org/T116757#1786600 (10Dispenser) a:3coren [20:33:06] (03PS1) 10Jean-Frédéric: Add unit test for CH1903Converter [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/251339 [21:27:53] valhallasw`cloud: wanna go through and finish/test/merge the fastapt stuff? [22:02:31] 6Labs, 10Tool-Labs, 10Beta-Cluster-Infrastructure, 7Database: Several hosts on virt1002 are down - https://phabricator.wikimedia.org/T117881#1786928 (10yuvipanda) 5Open>3Resolved a:3yuvipanda It just ran out of space again. [22:02:49] 6Labs, 10Tool-Labs, 10Beta-Cluster-Infrastructure, 7Database: Several hosts on virt1002 are down - https://phabricator.wikimedia.org/T117881#1786931 (10yuvipanda) (some instances were migrated off, has space now, and there's a critical alert for disk space on these hosts) [22:20:12] 10Wikibugs, 10Differential, 5Gerrit-Migration: Broadcast Differential activity to IRC - https://phabricator.wikimedia.org/T116330#1787017 (10mmodell) wikibugs code is scary. that is all. [22:21:05] 10Wikibugs, 10Differential, 5Gerrit-Migration: Broadcast Differential activity to IRC - https://phabricator.wikimedia.org/T116330#1787023 (10demon) >>! In T116330#1747176, @greg wrote: > Also, fwiw, you can still use Gerrit for scap dev, the scap3 team is just dogfooding differential to find the things we do... [22:27:34] twentyafterfour: haha did you look at wikibugs code? [22:27:41] I think it's scary partly because phab's api is so bad [22:34:09] YuviPanda: uh, not now, bed etc [22:34:24] would be best to do it some time in your morning [22:34:39] valhallasw`cloud: kkk I'll try block out some time for it tomorrow morning! [22:36:54] I have a skype meeting from 15-16 my time, so somewhere after at [22:37:39] but that's like 7 your time, so that's unreasonably early [22:38:06] and tomorrow evening I have a pubquiz to go to [22:38:07] argh. [22:38:58] valhallasw`cloud: heh :D we'll figure something out [22:39:07] kk [22:39:58] YuviPanda: yeah phab's api is pretty lame [23:07:59] 6Labs: Move estest100{1..3} instances to labvirt05, 10 or 11 - https://phabricator.wikimedia.org/T117927#1787102 (10EBernhardson) 3NEW [23:14:46] YuviPanda: ^^ [23:14:47] https://phabricator.wikimedia.org/T117927 [23:15:13] andrewbogott: so ^ ebernhardson wants to migrate some instances off hosts with lowish disk space onto 1005... [23:15:19] andrewbogott: mind if I do it? [23:16:02] YuviPanda: go ahead. Block migration has been working pretty well for me, most of the time. [23:16:12] andrewbogott: ok [23:16:13] But be warned that it is very slow! [23:16:16] how slow? [23:16:18] hours? [23:16:21] maybe hours [23:16:27] but the actual service interruption is much less [23:16:34] it's going to be days to reload the data anyways, so a few hours to move is no big deal [23:16:39] andrewbogott: right. and is that a restart or a suspend? [23:16:43] since it copies and then after the copy does a last minute suspend-and-sync [23:17:05] should be a suspend except sometimes it screws up and they wind up in state ‘shutoff’ for mysterious reasons. [23:17:30] right [23:17:32] ok [23:19:01] also if it makes any difference, can drop the seconday disks on all those machines, they don't have anything useful right now [23:19:12] the vd-second--local--disk [23:20:02] YuviPanda: relatedly… https://review.openstack.org/#/c/242251/ [23:20:23] ebernhardson: I'm not sure if you dropping them now would actually shrink them [23:29:00] 10Tool-Labs-tools-Other, 6Wikisource: PDF/Epub output not done of sub page of Bengali Wikisource - https://phabricator.wikimedia.org/T117879#1787189 (10Billinghurst) @tpt from memory, a long time you got me to make some configuration changes to some templates or config at enWS to make this work (better) at enW... [23:29:18] andrewbogott: should I also do one instance at a time? [23:29:34] YuviPanda: maybe? It’s pretty erratic [23:29:53] there’s a throttle on the bandwidth it can use, so it won’t cause too much trouble if you do several at once [23:30:04] kk [23:30:10] I'll do one first to try out and see how it goes [23:30:27] 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 5Patch-For-Review, and 3 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1787191 (10MusikAnimal) @coren I'm planning to push an update to my tools tomorrow, and you are willing to accompany me that would be fantastic. This p... [23:31:47] andrewbogott: hmm, that live-migrate command just returned immediately. [23:31:52] 10Wikibugs, 10Differential, 5Gerrit-Migration: Broadcast Differential activity to IRC - https://phabricator.wikimedia.org/T116330#1787193 (10Legoktm) I started looking at adding differential support to wikibugs, the main blocker is a lack of transaction data which is currently provided by the `maniphest.gett... [23:31:55] ah [23:31:57] ok [23:31:59] yep, it’s REST :) [23:31:59] use show [23:32:16] !log search migrate estest1001 to labvirt1005 [23:32:21] ebernhardson: ^ [23:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Search/SAL, Master [23:32:32] ebernhardson: I'll migrate the other two simultaneously once this one is done [23:32:38] andrewbogott: how close are the two new nodes to being pooled in? [23:33:03] hm... [23:33:32] pretty close, I guess? virt1005 was annoying me the other day but it’s been fine since then. [23:33:32] And there’s no reason to suspect 1010 really. [23:33:37] But, you know, that’s probably best done on a Monday [23:34:29] right haha [23:34:32] ok [23:34:45] I'm just worried we'll run out of disk and get paged over the weekend [23:34:49] well, hopefully get paged [23:35:01] 6Labs: Move estest100{1..3} instances to labvirt05, 10 or 11 - https://phabricator.wikimedia.org/T117927#1787211 (10yuvipanda) a:3yuvipanda