[00:13:50] jynus, if you're still around, see https://phabricator.wikimedia.org/T146718 [00:29:01] !log tools Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/ [00:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:38:24] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [02:01:53] !log tools Reenabled puppet on tools-checker-01 [02:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [02:08:26] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:13:21] PROBLEM - Free space - all mounts on tools-bastion-02 is CRITICAL: CRITICAL: tools.tools-bastion-02.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-bastion-02.diskspace.root.byte_percentfree (<40.00%) [05:52:35] 10Tool-Labs-tools-Xtools: xtools on Tool Labs: Rep Lag High - https://phabricator.wikimedia.org/T156345#2975323 (10scfc) `/data/project/xtools/modules/WebTool.php` (which I //assume// is the code path that is triggered) has: ``` function checkReplag( $dbr ) { global $I18N; $res = $dbr->query("... [06:52:18] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:27:19] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [08:45:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:26:26] 06Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA, 10Wikimedia-Developer-Summit (2017): Labsdbs for WMF tools and contributors: get more data, faster - https://phabricator.wikimedia.org/T149624#2758290 (10Legoktm) I have moved the notes to https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/La... [09:36:03] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [09:37:05] 06Labs, 06Operations, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2975634 (10faidon) asw-c2-eqiad was replaced yesterday (Jan 26 17:50 UTC) with one of our spares. Total downtime was approximately 30 minutes mostly due to the recabling effort but... [10:07:07] (03PS1) 10Addshore: Add MoveToCommons repos to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334518 [10:08:28] (03PS1) 10Addshore: Remove WikibaseLexeme from #wikidata [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334519 [10:09:47] (03PS1) 10Addshore: Phab: Add Two-Column-Edit-Conflict-Merge to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334520 [10:10:34] (03PS1) 10Addshore: Phab: Add InterwikiSorting to wmde tech channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334521 [10:12:19] (03PS1) 10Addshore: Phab: Add Addwiki to ##add [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334523 [10:12:54] spm spm spm [10:19:12] (03CR) 10Legoktm: [C: 032] Phab: Add Addwiki to ##add [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334523 (owner: 10Addshore) [10:19:26] (03CR) 10Legoktm: [C: 032] Phab: Add InterwikiSorting to wmde tech channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334521 (owner: 10Addshore) [10:19:33] (03Merged) 10jenkins-bot: Phab: Add Addwiki to ##add [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334523 (owner: 10Addshore) [10:19:40] (03CR) 10Legoktm: [C: 032] Phab: Add Two-Column-Edit-Conflict-Merge to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334520 (owner: 10Addshore) [10:19:42] legoktm: :D [10:19:42] (03CR) 10jenkins-bot: Phab: Add Addwiki to ##add [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334523 (owner: 10Addshore) [10:19:55] (03CR) 10Legoktm: [C: 032] Remove WikibaseLexeme from #wikidata [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334519 (owner: 10Addshore) [10:20:04] (03Merged) 10jenkins-bot: Phab: Add Two-Column-Edit-Conflict-Merge to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334520 (owner: 10Addshore) [10:20:06] (03Merged) 10jenkins-bot: Phab: Add InterwikiSorting to wmde tech channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334521 (owner: 10Addshore) [10:20:09] (03CR) 10Legoktm: [C: 032] Add MoveToCommons repos to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334518 (owner: 10Addshore) [10:20:11] (03CR) 10jenkins-bot: Phab: Add Two-Column-Edit-Conflict-Merge to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334520 (owner: 10Addshore) [10:20:13] (03CR) 10jenkins-bot: Phab: Add InterwikiSorting to wmde tech channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334521 (owner: 10Addshore) [10:20:16] (03Merged) 10jenkins-bot: Remove WikibaseLexeme from #wikidata [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334519 (owner: 10Addshore) [10:20:28] (03CR) 10jenkins-bot: Remove WikibaseLexeme from #wikidata [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334519 (owner: 10Addshore) [10:20:33] legoktm: do they deploy temselves? :p [10:20:45] !log tools.wikibugs Updated channels.yaml to: e7d1fc7cac6a53b7c1e58a9d9cbe231726e75468 Merge "Remove WikibaseLexeme from #wikidata" [10:20:48] addshore: yes [10:20:49] :) [10:20:52] cool! [10:20:55] can I merge there? :P [10:22:01] addshore: just gave you +2 [10:22:06] thanks! :D [10:22:13] * addshore adds the repo to his watchlist [10:22:23] the two yaml files deploy automatically but everything else will require manual deployment [10:22:46] legoktm: epic! [10:29:28] (03Merged) 10jenkins-bot: Add MoveToCommons repos to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334518 (owner: 10Addshore) [10:29:36] (03CR) 10jenkins-bot: Add MoveToCommons repos to wmde tech chan [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/334518 (owner: 10Addshore) [11:25:03] 06Labs, 10MediaWiki-extensions-Page_Forms, 10wikitech.wikimedia.org: Accesing Special:FormEdit gives a blank empty page - https://phabricator.wikimedia.org/T156406#2976041 (10MarcoAurelio) >>! In T156406#2974277, @bd808 wrote: > @MarcoAurelio was there something particular you were trying to work on, or did... [12:09:51] 06Labs, 06Operations, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2957555 (10fgiunchedi) >>! In T155875#2975634, @faidon wrote: > During the whole 30 minute window there was also an increased response time from the MediaWiki API, that cascaded in... [12:41:08] PROBLEM - Free space - all mounts on tools-exec-1221 is CRITICAL: CRITICAL: tools.tools-exec-1221.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-exec-1221.diskspace.root.byte_percentfree (<55.56%) [13:00:18] 10Tool-Labs-tools-Xtools: xtools on Tool Labs: Rep Lag High - https://phabricator.wikimedia.org/T156345#2976217 (10jcrespo) The table is documented at: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag [14:46:02] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:26:09] RECOVERY - Free space - all mounts on tools-exec-1221 is OK: OK: tools.tools-exec-1221.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [15:35:40] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:37:01] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:42:45] o/ labs people. Happy Friday. :) [16:18:43] \o halfak :) [16:19:53] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2976789 (10faidon) Ping! Early February is now a week away. [16:22:18] 06Labs, 10The-Wikipedia-Library: Change URL from twl-test.wmflabs.org to wikipedialibrary.wmflabs.org - https://phabricator.wikimedia.org/T152468#2976795 (10Aklapper) @Samwalton9: Hmm, this is already in the #Labs basket... You could try reaching out [[ https://lists.wikimedia.org/mailman/listinfo/labs-l | on... [16:25:18] 06Labs, 10Tool-Labs-tools-Other, 06Commons: Provide service to filter over categorization from a list of Commons categories - https://phabricator.wikimedia.org/T110833#1587559 (10zhuyifei1999) FWIW, recent discussion: https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2017/01#Is_there_any_tool_o... [16:27:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:45:41] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:08:03] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:17:15] 06Labs, 10Labs-Infrastructure: keystone admin api easily overwhelmed - https://phabricator.wikimedia.org/T156337#2976983 (10Andrew) I ran uwsgitop on this for a while, and there is definitely no lack of threads (I saw a failure when only 3 of 10 processes were busy.) So I'm not sure what's happening... going... [17:24:04] 06Labs, 06Operations, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2977003 (10madhuvishy) [17:27:06] Could someone take a look at https://phabricator.wikimedia.org/T152468? It's just a quick URL change :) [17:27:24] 06Labs, 06Operations, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2977010 (10madhuvishy) [17:27:28] 06Labs, 06Operations, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2977008 (10madhuvishy) 05Open>03Resolved Closing this now. Noting that https://wikitech.wikimedia.org/wiki/Incident_documentation/20170118-Labs happened during... [17:28:27] 06Labs, 10The-Wikipedia-Library: Change URL from twl-test.wmflabs.org to wikipedialibrary.wmflabs.org - https://phabricator.wikimedia.org/T152468#2977013 (10chasemp) @andrew when you get a second let's chat about this :) [17:28:35] Samwalton9: I put in a note for andrewbogott to chat about it,I'm not sure it's as simple as you expect [17:28:41] but maybe so [17:30:01] Ah, thanks chasemp. Where might the complications be? [17:30:45] url ownershp is by project and so it's about a new thing and ownership rights and removing teh old thing more than naming [17:31:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:32:01] Samwalton9: this was heavily redone not too long ago and I honestly jsut don't know all the details beyond those broad strokes [17:32:25] Ok, thanks for looking into it :) [17:45:02] 06Labs, 10The-Wikipedia-Library: Change URL from twl-test.wmflabs.org to wikipedialibrary.wmflabs.org - https://phabricator.wikimedia.org/T152468#2977118 (10Andrew) 05Open>03Resolved a:03Andrew This is done. I didn't move anything, but added a new proxy at wikipedialibrary.wmflabs.org pointing to the sa... [17:54:05] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2977163 (10yuvipanda) On further thought, I think I just want to use the aptly that we've setup for tools already. 1. We already use this for other package... [18:06:41] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:07:39] 06Labs, 10The-Wikipedia-Library: Change URL from twl-test.wmflabs.org to wikipedialibrary.wmflabs.org - https://phabricator.wikimedia.org/T152468#2977184 (10Samwalton9) That's awesome - thank you! [18:46:57] 10Tool-Labs-tools-Xtools: xtools on Tool Labs: Rep Lag High - https://phabricator.wikimedia.org/T156345#2977349 (10Matthewrbowker) p:05Triage>03Low a:05MusikAnimal>03Matthewrbowker This is a known issue, xTools predates any sort of replag table. I'll go in and make the change if @MusikAnimal doesn't g... [19:13:22] PROBLEM - Free space - all mounts on tools-bastion-02 is CRITICAL: CRITICAL: tools.tools-bastion-02.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-bastion-02.diskspace.root.byte_percentfree (<10.00%) [19:22:04] !log reboot tools-bastion-02 as it is having issues [19:22:05] Unknown project "reboot" [19:22:09] !log tools reboot tools-bastion-02 as it is having issues [19:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:28:22] RECOVERY - Free space - all mounts on tools-bastion-02 is OK: OK: tools.tools-bastion-02.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [20:14:06] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: keystone admin api easily overwhelmed - https://phabricator.wikimedia.org/T156337#2977603 (10Andrew) Yuvi's proposed fixes are: 1. http-socket -> http, 2. get rid of threads= [20:15:09] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: keystone admin api easily overwhelmed - https://phabricator.wikimedia.org/T156337#2971320 (10yuvipanda) Also to use uwsgi::app and not service::uwsgi, since the latter is meant for use with public facing 'services' (like ores or striker) running behind varni... [20:17:02] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: keystone admin api easily overwhelmed - https://phabricator.wikimedia.org/T156337#2977617 (10yuvipanda) The logging related lines in the uwsgi config also seem to be noops, with no logs to be found there. Some logs in /var/logs/upstart/uwsgi-keystone-admin.l... [20:25:04] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2879938 (10scfc) Is there really much experience with `aptly`? :-) I stumble along with it quite a bit, and – if possible – I would much rather switch the... [22:11:06] Broadcast message from rush@tools-bastion-02 (/dev/pts/30) at 19:22 ... The system is going down for reboot NOW! <- did I miss a maintenance announcement? [22:14:10] chasemp: ^^ [22:23:03] multichill: nope, bastion-02 needed a reboot. Check the SAL usually to see if expected but unplanned. [22:32:43] chasemp: You do realize you don't really sound very user-friendly now, right? [22:33:28] multichill: there's never be a hard guarantee of uptime for the bastions. that's why we have the job grid and k8s [22:34:09] we don't yank the servers just for fun, but sometimes reboots are needed [22:34:25] Sure, just communicate it properly [22:34:54] what specific improvements do you suggest? [22:35:06] For one, set a message if you do the shutdown [22:36:38] And "needed a reboot" sounds like you're doing your Windows Updates, that doesn't explain why you had to do it right away [22:36:39] :/ looks like `reboot` doesn't even support that [22:37:08] shutdown does of course [22:38:39] I appreciate the desire of tools users to know what's going on, but server restarts are completely routine. Its a waste of everyone's time to describe each one in detail. [22:39:16] if we sent an email for each thing we did that might effect users we'd get yelled at for spamming [22:39:58] but I would like to find a good balance of how to do this [22:41:18] It's all about managing expectiation and decent communication. I'm used to server admins setting the shutdown quite early so you would get notices and switch servers [22:42:36] I honestly don't know what the issue was this time, but I would be willing to guarantee that it was an emergent issue that required intervention. [22:43:12] that being said, bastions are for launching jobs to one of the two available job grid systems [22:43:52] they aren't a time share system for random interactive jobs that should require advance notice [22:44:44] long term I'd like a system where each interactive session ends up in a k8s pod of its own [22:44:58] that should make things both more stable and more isolated [22:47:26] bd808: The bastion hosts have excellent uptime. Just communicating can be better. No technical changes needed [22:48:30] Ah, I wasnt intending to be rude multichill, only to answer your question so you would be able to check without needing to ask [22:49:41] In my mind a SAL entry is communicating it, that may be not a shared expectation [22:53:17] chasemp: shutdown -r [22:53:36] Or shutdown -r "This server is sick guys, have to reboot" [22:53:40] That's the only difference [22:54:25] I ask nothing more. Good night [22:55:04] night multichill. we'll try to do better :)