[00:07:52] 10Tool-Labs-tools-Pageviews: Add % mobile to Topviews - https://phabricator.wikimedia.org/T156081#2963591 (10MusikAnimal) [00:21:38] gry: let me find the Puppet file that provides the default config... [00:22:12] hi, bd808 , i did not use puppet before, if you could guide me how to set that up it would be great [00:29:34] gry: finally found it. the config isn't in puppet anymore. It is part of the webservice python application -- https://github.com/wikimedia/operations-software-tools-webservice/blob/fca3728739087b51e074102ae8960ccc6793c37b/toollabs/webservice/services/lighttpdwebservice.py [00:29:44] * bd808 updates that on wikitech [00:29:53] bd808: good one, where do i put it ? [00:30:27] (03CR) 10Greg Grossmeier: [C: 04-1] "No, please keep those in -devtools. No sense in repeating everything from there into -releng. We (the relevant parties of RelEng) watch bo" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/333539 (owner: 10Paladox) [00:31:07] gry: those defaults are magically put in the right place for you when you do `webservice start`. If you want to add custom overrides, take a look at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Configuring_the_web_server [00:31:35] basically you make a file at $TOOL/.lighttpd.conf that adds to the base configuration [00:33:14] bd808: 'webservice restart' does not create any lighthttpd.conf ? http://dpaste.com/3SYYY42.txt [00:33:15] I'm not sure if modstatus will work or not (never tried it on tool labs). I think that you would load the module with 'server.modules += ( "mod_status" )' [00:34:15] ah ok, i will make ~/.lighttpd.conf and i think it would override the default config [00:34:37] gry: not in the tool's directory, no. It gets written to /var/run/lighttpd/$toolname on the web exec host [00:35:04] it writes «As it starts, the web server reads any configuration in $HOME/.lighttpd.conf, and merges it with the default configuration (which is likely to be adequate for most tools). » [00:35:11] $HOME is /var/run/lighttpd/$toolname ? [00:35:14] correct [00:35:44] no, it would be /data/project/$toolname [00:36:19] if i want custom settings for the webserver, i put it to ~/.lighttpd.conf not to /var/run/lighttpd/$toolname, right ? [00:36:41] that file is checked for and if found concatenated to the end of the base config. the result is written out only locally on the exec node that is running the tool's webserver job [00:37:11] gry: correct. put your extra stuff in ~/.lighttpd.conf after becoming the tool [00:37:18] ok :-) [00:38:40] bd808: how do i append to server.modules after it was defined once ? i would like to load a new module [00:38:57] try 'server.modules += ( "mod_status" )' [00:40:43] excellent, thank you bd808 :-) i got the status page load now [00:40:54] awesome [00:41:36] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2963643 (10stwalkerster) >>! In T143349#2962892, @chasemp wrote: >> account-creation-assistance accounts-db2 > Any plans to upgrade this instance? We've already migrated off that instan... [04:15:10] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2963902 (10yuvipanda) I checked with @Magnus - going to shut off wikidata-wdq-mm now, can be deleted in a few days [04:57:46] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2963953 (10TParis) I'm in the middle of a move from Hawaii to Texas, Deltaquad would be the best poc. But if it isn't moved by March 1st then I'll take care of it. Vr TParis [05:03:15] 10Striker, 13Patch-For-Review: Check for 2FA protection and enforce validation of 2FA tokens - https://phabricator.wikimedia.org/T144712#2963976 (10bd808) [05:03:19] 06Labs, 10Striker, 06Security-Team, 10Wikimedia-Site-requests, and 3 others: Add user group to wikitech granting the oathauth-api-all right - https://phabricator.wikimedia.org/T153487#2963975 (10bd808) 05Open>03Resolved [05:10:01] Jamesofur: i set up https://redmine.lighttpd.net/projects/1/wiki/Docs_ModStatus and i think i found that https://tools.wmflabs.org/dupdet/compare.php?url1=https://en.wikipedia.org/index.php?title=Cognitive_peer-to-peer_networks&oldid=730517492&url2=http://www.sciencedirect.com/science/article/pii/S1084804516301084&minwords=3&minchars=13&removequotations=&removenumbers= crashes dupdet [05:10:10] Jamesofur: `curl` on that sciencedirect ip also hangs [05:11:02] Jamesofur: (might be only tangentially related -- i'll see whether it's enough of an issue to build up 1024+ connections and reach the limit) [05:27:08] (i added connection timeout limit in a copy, will watch whose of the two ages long connections dies and whose doesn't a few hours later when i'm at the internet again) [06:03:12] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, and 4 others: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#2964119 (10bd808) Created #tool-labs-standards-committee and kicked off discussion of {T156075} [06:04:01] 06Labs, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Sunset of WDQ - https://phabricator.wikimedia.org/T153439#2964122 (10Multichill) In T143349#2963021 @yuvipanda shutdown a VM. Looks like http://wdq.wmflabs.org/stats and all queries are giving a time out now. Not sure it's related. [06:24:37] 06Labs, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Sunset of WDQ - https://phabricator.wikimedia.org/T153439#2964184 (10yuvipanda) No, I shut down the VM that was being used before wdq got its own project, so it should be unrelated. [06:40:35] 06Labs, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Sunset of WDQ - https://phabricator.wikimedia.org/T153439#2964188 (10yuvipanda) I started it back up just in case, but see no difference. Also restarted the wdq-mm service [06:47:07] RECOVERY - Free space - all mounts on tools-exec-1221 is OK: OK: tools.tools-exec-1221.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [07:27:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [07:54:10] 06Tool-Labs-standards-committee: Figure out how communications and meetings will work for the Tool Labs standards committee - https://phabricator.wikimedia.org/T156075#2963376 (10zhuyifei1999) Yes, IRC channel plus mailing list. I wouldn't like to go PM-ing with 6 people at once ;) [08:02:41] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:24:38] 06Tool-Labs-standards-committee: Figure out how communications and meetings will work for the Tool Labs standards committee - https://phabricator.wikimedia.org/T156075#2963376 (10Ladsgroup) I'd like to have a hidden satanic cult in Alps (preferably Switzerland) for the meetings but that's up to other members. Be... [09:06:37] 06Labs, 10Tool-Labs, 10DBA: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2964387 (10Marostegui) We granted you all on: `p50380g50491_common` the other databases didn't exist on either labs or tool boxes. [09:11:58] 06Labs: Request creation of wikidata-federation labs project - https://phabricator.wikimedia.org/T154659#2964435 (10WMDE-leszek) [09:23:41] 06Labs, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Sunset of WDQ - https://phabricator.wikimedia.org/T153439#2964457 (10Magnus) If you can see http://wdq.wmflabs.org/ then the WDQ server is up and running. If the stats fail, that's an internal problem. I will look into it later, but won't have the... [10:54:38] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Migrate integration-publisher service to use a Jessie instance - https://phabricator.wikimedia.org/T156064#2964629 (10hashar) Jobs updated INFO:jenkins_jobs.builder:Number of jobs generated: 31 INFO:jenkins_jobs.b... [11:04:56] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Migrate integration-publisher service to use a Jessie instance - https://phabricator.wikimedia.org/T156064#2964651 (10hashar) 05Open>03Resolved a:03hashar Validated by triggering the job operations-puppet-doc... [11:05:00] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2964654 (10hashar) [11:06:18] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2919816 (10hashar) [12:31:24] 06Labs, 10DBA, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964878 (10Marostegui) [12:31:48] 06Labs, 10DBA, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui) [12:49:35] !log tools purge ancient kubectl, kube-apiserver, kube-controller-manager, kube-scheduler packages from tools-k8s-master-01, these were my old terrible packages [12:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:51:16] 06Labs, 10DBA, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964931 (10Marostegui) [12:51:57] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:54:30] ^ is fine [13:01:56] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:33:31] 06Labs, 10DBA, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965022 (10Marostegui) [13:40:22] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:50:23] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [13:52:15] !log tools finished upgrading k8s + using debs [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:52:25] !log tools upgrading k8s on worker nodes to use debs + new k8s version [13:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:10:31] 06Labs, 10DBA, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965171 (10Marostegui) [14:10:43] !log video pip installing pywikibot at 7ac0142 across all encoding hosts T155455 [14:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [14:10:47] T155455: Start using async chunked uploads - https://phabricator.wikimedia.org/T155455 [14:12:25] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2965177 (10chasemp) >>! In T143349#2962963, @chasemp wrote: > Greetings @Crazycomputers @DeltaQuad @Tparis from https://wikitech.wikimedia.org/wiki/Nova_Resource:Utrs > > Do you have pl... [14:13:19] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2965184 (10chasemp) [14:15:16] !log video git pulling v2c code to 31479a4 across all encoding hosts T155455 [14:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [14:19:03] !log video depooling encoding01 & 02 and waiting for tasks to finish before restarting (reload the code) T155455 [14:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [14:19:07] T155455: Start using async chunked uploads - https://phabricator.wikimedia.org/T155455 [14:23:54] yuvipanda: so I took a look at Andrew's most recent reminder email ([Labs-l] [Labs-announce] REMINDER: Ubuntu Precise instances on Labs will be shut down at the end of March) and it occurred to me that the one instance listed for social-tools -- social-tools1 -- is the old, broken and unused instance; http://social-tools.wmflabs.org/ uses the social-tools2 instance AFAIK. can you double-check that [14:23:56] my assumptions are correct? and even better, can you then nuke that ancient social-tools1 instance? :P [14:25:11] ashley: can you respond on https://phabricator.wikimedia.org/T143349 instead? [14:25:20] we are trying to contain the narrative there for everyone involved [14:25:51] chasemp: will do, thanks! [14:26:27] ashley: thank you [14:28:07] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2921002 (10ashley) AFAIK http://social-tools.wmflabs.org/ is using the social-tools2 instance, whereas the instance listed for deletion (social-tools1) is an older, broken, unused one. P... [14:30:46] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2965221 (10chasemp) [14:37:58] !log tools disable puppet on tools-proxy-01 (active proxy) to check deploying debianized kube-proxy on proxy-02 [14:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:41:00] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs by 03/31/2017 - https://phabricator.wikimedia.org/T143349#2965245 (10Krenair) >>! In T143349#2965213, @ashley wrote: > AFAIK http://social-tools.wmflabs.org/ is using the social-tools2 instance, whereas the instance listed for deletion (social-... [14:42:45] PROBLEM - Puppet run on tools-proxy-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:42:47] !log tools re-enable puppet on tools-proxy-01, test success on proxy-02 [14:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:44:53] RECOVERY - Host tools-webgrid-lighttpd-1201 is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [14:46:10] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2965255 (10Andrew) [14:46:14] 06Labs, 10DBA, 10Wikidata, 07Performance, and 3 others: Increase quota for wikidata-dev project - https://phabricator.wikimedia.org/T155042#2965251 (10Andrew) 05Open>03Resolved a:03Andrew It looks to me like there's quite a bit of quota room in wikidata-dev already. I increased the max instance coun... [14:52:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [14:52:44] RECOVERY - Puppet run on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:53:20] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [14:56:22] 06Labs, 06Operations, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2965288 (10Cmjohnson) added a secondary switch, asw2-c2-eqiad. accessible via scs port 48 [14:56:47] 06Labs, 10DBA, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2965291 (10jcrespo) [14:56:49] 06Labs, 10Tool-Labs, 10DBA: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965290 (10jcrespo) 05Open>03Resolved [14:58:16] !log video killing the rest of the tasks on encoding01 because they aren't going to succeed anyways; depooling encoding03; reloading encoding01 T155455 [14:58:18] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [3600.0] [14:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [14:58:19] T155455: Start using async chunked uploads - https://phabricator.wikimedia.org/T155455 [15:04:48] chasemp: hello! I got rid of a Precise instance ( integration-publisher ) was quite trivial to do [15:04:55] sorry I have missed that one [15:05:02] 06Labs, 10Analytics, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821#2965325 (10Milimetric) No, we should leave it open and blocked on wikitech being set up properly. We could of course collect pageviews via some othe... [15:05:03] hashar: no worries and thank you [15:05:06] the other Precise instances for CI are still for PHP 5.3 :( [15:05:24] I did try to compile Zend 5.3 against Jessie but without success. My C is next to nill! [15:06:54] hashar: yeah I was going to ask for a bit more info there so that we are all on teh same page, that will possibly end up being an outlier and keep the "goal" yellow [15:07:04] which if it has to be then ok but wanted to walk through it and make tickets and such [15:07:08] but probably later this week [15:07:35] the end idea is to still be able to run tests for MediaWiki 1.23 which is Zend 5.3. [15:08:00] maybe we would be fine with a single instance. Or maybe we can look at running them manually in a Vagrant Precise box instead of via ci [15:08:09] this way the goal is greenish and Precise gone by end of March :] [15:08:35] ^ would be great :D [15:08:36] or in a precise docker / rkt container ;) [15:09:21] yeah [15:09:36] though I have doubt we will have anything running on rkt by end of quarter! [15:12:55] hashar: the advantage of rkt over docker is that you can easily do 'sudo rkt run --image= --volumes= ' and it'll behave exactly like just running [15:12:59] unlike docker which won't [15:13:07] yuvipanda: yeah totally agreed [15:13:21] I think the misconception is that we have experimented with plain docker + Dockerfile [15:13:36] and really dont care whether it is Docker itself or Rkt running the containers [15:13:37] well [15:13:42] actually no we care [15:13:48] we want to use whatever ops uses :] [15:14:13] hashar: yah, but this would be completely unrelated to all of that :) it'll just be a small hack that'll work with the existing infrastructure to allow us to get rid of precise [15:14:45] yup [15:14:59] maybe that could be a first good poc to move CI to k8s [15:15:14] then, we have no clue where to host the CI containers yet :/ [15:15:20] no :) [15:15:28] it's all unrelated, and you'll just be able to replace /usr/bin/php with a different script [15:15:55] ?????????????????????????????? [15:16:03] hold on [15:16:15] that's why I was talking about rkt. [15:16:25] just have to install rkt then it can run the container without the need of some daemon / etcd / k8s etc? [15:16:42] this is purely a way to run php 5.3 on trusty by just putting it in a container image and running it with rkt [15:16:42] hashar: exactly [15:16:43] hashar: rkt has no daemons [15:17:25] so we would craft a Precise docker image that has everything we need. Make it so the entry point is /usr/bin/php [15:17:30] then instead of php foo [15:17:53] run rkt --image=ci-precise tests/phpunit ? :D [15:18:28] hashar: exactly [15:18:42] hashar: and mount whatever you want from the host [15:18:55] hashar: so if you want /srv/home/something on the host to be same on container, that's trivial [15:19:26] and rkt runs the container inside a kvm right? [15:20:07] hashar: not by default no [15:20:14] hashar: that won't really be useful for us since we're already running inside kvm [15:20:30] what I am wondering is whether we could rkt the containers on the prod machine contint1001 / contint2001 [15:20:45] which together probably have more power combined than all the CI labs instances [15:21:25] hashar: > sudo -E rkt run --volume curdir,kind=host,source=$PWD --volume passwd,kind=host,source=/etc/passwd --volume=group,kind=host,source=/etc/group --volume=resolv,kind=host,source=/etc/resolv.conf --insecure-options=image --interactive docker://ubuntu:latest --mount volume=passwd,target=/etc/passwd --mount volume=group,target=/etc/group --mount volume=resolv,target=/etc/resolv.conf --mount volume=curdir,target=$PWD --user=`id -u` [15:21:26] --group=`id -g` --net=host --inherit-env --hostname=`hostname` --set-env=USER=$USER --exec /bin/bash -- -c 'cat /etc/passwd' [15:21:47] that will run whatever you pass affter -- as the same user as the one executing command, but inside a docker container [15:22:05] hashar: I'd rather have this run in a labs instance, since running things in prod will be more work [15:22:18] hashar: the smallest operation would be to just change the entrypoint php5.3 uses to use a wrapper bash script that runs it under rkt [15:24:02] okkkkk [15:24:38] the fun will be in provisioning an image with the proper deps [15:24:56] then I guess if we do that for Precise, that will serve for other jobs / Jessie [15:25:09] hashar: should be fairly trivial :) [15:29:58] yuvipanda: I filled a task yesterday about migrating to bootstrapvz [15:30:12] I went with a different image building system ( diskimage-builder from openstack ) [15:30:27] hashar: migrating what? [15:30:28] hashar: bootstrapvz only supports debian I think [15:31:19] o [15:31:28] I was talking about the CI images uploaedd to openstack that Nodepool uses [15:31:41] the aim being to phase out trusty entirely from CI [15:31:44] and solely run on Jessie [15:32:28] !log video git pulling v2c code to b2d6e84 across all encoding hosts T155455 [15:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [15:32:31] T155455: Start using async chunked uploads - https://phabricator.wikimedia.org/T155455 [15:32:58] hashar: no I mean, keep using the jessie / trusty images you are currently using [15:33:15] hashar: and run php5.3 from inside them [15:33:16] very minimal change [15:36:24] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:38:19] 06Labs, 10Tool-Labs, 10DBA: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965491 (10Tb) Apologies, double-underscores seem to get eaten by Phabricator's markup parser. On s1.labsdb: ``` p50380g50491_common p50380g50491__rlrl_enwiki_p p50380g50491__rlrl_ptwiki_p... [15:41:20] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [15:43:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:45:38] 06Labs, 10Tool-Labs, 10DBA: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965532 (10Marostegui) Thanks for the clarification. I have granted access to those databases. Please check them and let us know if that works! [15:49:28] !log tools clush -g all 'sudo rm /usr/local/bin/kube*' to get rid of old kube related binaries [15:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:04:15] 06Labs, 10Tool-Labs, 10DBA: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2958263 (10jcrespo) @Marostegui and others ops, grants are wildcards, **never** use _ without escaping (\_) on a grant. It is not a big deal here, but it can lead to security problems. [16:04:25] (03PS1) 10Alexandros Kosiaris: Add passwords::redis::ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/333924 [16:13:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [16:16:04] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add passwords::redis::ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/333924 (owner: 10Alexandros Kosiaris) [16:16:31] 06Labs, 10Tool-Labs, 10DBA: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965610 (10jcrespo) @Tb your grants have been added- you should be able to access old data- however, you should consider those grants temporary, until you rename the databases to start with `... [16:17:25] !log video encoding03 free disk space < 10G, renicing PID 9136 (50c81becdbd5a1c5) & 3499 (a32b3fa5dc0e2123) & 3611 (0e73d27e7cb00284) to niceness of 1, so these near-complete tasks can finish early and clear some disk space [16:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [16:21:07] PROBLEM - SSH on tools-webgrid-lighttpd-1201 is CRITICAL: Connection refused [16:23:25] PROBLEM - Host tools-webgrid-lighttpd-1201 is DOWN: CRITICAL - Host Unreachable (10.68.18.45) [16:26:31] RECOVERY - Host tools-webgrid-lighttpd-1201 is UP: PING OK - Packet loss = 0%, RTA = 476.48 ms [16:28:58] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2965667 (10Andrew) [16:29:00] 06Labs: Increase resource quota for dwl - https://phabricator.wikimedia.org/T152456#2965665 (10Andrew) 05Open>03stalled Sorry for the delay! I've adjusted quotas so that you have headroom to add one bigram instance on top of what you have in the project now. After you're done migrating and deleting old ins... [16:29:21] 06Labs: Lower quotas to current usage for dwl (when ready) - https://phabricator.wikimedia.org/T152456#2965668 (10Andrew) [16:31:08] RECOVERY - SSH on tools-webgrid-lighttpd-1201 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [16:34:16] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [43200.0] [16:38:37] 06Labs, 10Tool-Labs, 10DBA: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965716 (10Superyetkin) Could you please share the configuration details of new servers? Most of my tools [[http://tools.wmflabs.org/superyetkin/kategorisizsayfalar.php | like this]] (running on trwiki... [16:39:19] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:44:23] 06Labs, 10Tool-Labs, 10DBA: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965719 (10jcrespo) @Superyetkin I cannot guarantee it will not change in the future, but you can connect, in the case of **enwiki** to the `labsdb-web.eqiad.wmnet` host for short-lived, web-like reque... [16:52:41] 06Labs, 10DBA, 10Wikidata, 07Performance, and 3 others: Increase quota for wikidata-dev project - https://phabricator.wikimedia.org/T155042#2965728 (10Ladsgroup) We cleaned some instances and it's okay now. Probably we will make more soon. [16:58:59] PROBLEM - Host tools-mail-01 is DOWN: CRITICAL - Host Unreachable (10.68.22.188) [17:02:19] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Joal was modified, changed by Joal link https://wikitech.wikimedia.org/w/index.php?diff=1383744 edit summary: [17:08:09] PROBLEM - Free space - all mounts on tools-exec-1221 is CRITICAL: CRITICAL: tools.tools-exec-1221.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-exec-1221.diskspace.root.byte_percentfree (<44.44%) [17:15:35] !log tools stopping tools-mail, backing up, upgrading from precise to trusty [17:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:17:05] !log tools backing up tools-mail to ~root/8c499e6e-1b79-4bb1-8f7f-72fee1f74ea5-backup on labvirt1009 [17:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:18:08] PROBLEM - Host tools-mail is DOWN: CRITICAL - Host Unreachable (10.68.16.27) [17:19:02] !log tools restarting tools-mail, beginning do-release-upgrade -d -q [17:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:22:25] RECOVERY - Host tools-mail is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:25:19] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:46:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:51:31] !log tools rebooting tools-mail post upgrade [17:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:52:19] 06Labs, 10Tool-Labs, 10DBA: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965899 (10Tb) Great thanks. although I missed one in the list above; can you grant all to s51111 on p50380g50491_inconsistent_redirects on s1.labsdb also please. [18:00:40] andrewbogott: I just did a test of the tools-mail functionality and it seems ok :D [18:00:56] !log tools apt-get autoremove on tools-mail [18:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:01:08] yuvipanda: great, I'm going to do a bit more cleanup and reboot again before declaring victory [18:01:21] andrewbogott: awesome, ok [18:05:21] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [18:12:02] yuvipanda: there are like 25 old kernels on that box, apt is still cleaning [18:12:12] andrewbogott: :D nice [18:12:20] andrewbogott: we should clush a clean on all boxes sometime [18:12:38] yeah, wouldn't hurt [18:13:02] the kernel post-uninstall script seems to do a system-wide 'find' or something, it's super slow [18:13:58] !log one last reboot of tools-mail [18:13:58] Unknown project "one" [18:14:03] !log tools one last reboot of tools-mail [18:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:23:49] !log video killed the rest of tasks on encoding02 & encoding03, they are gonna fail anyways [18:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [18:27:53] !log video repooling encoding02 & encoding03, depooling encoding01 T155455 [18:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [18:27:56] T155455: Start using async chunked uploads - https://phabricator.wikimedia.org/T155455 [18:37:03] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [18:40:47] 06Labs: Clean up backups of tools-mail on labvirt1009 - https://phabricator.wikimedia.org/T156160#2966033 (10Andrew) [18:43:44] !log phabricator deleting phab-03 and phab-05 to allow us to create one large instance for replacement for phab-01 (might) [18:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Phabricator/SAL [18:43:50] !log phabricator please stop using phab-01,-03,-05, none of them use puppet. please use instance "phabricator", this actually has the prod role [18:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Phabricator/SAL [18:45:18] !log phabricator deleting phabricator instance and recreating it, replacing phab-01 with a phabricator instance :) [18:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Phabricator/SAL [19:06:47] hello! so it appears the crontab for all of my tools has been cleared!? [19:07:43] yuvipanda bd808 ? was this part of the NFS maintenance? I seem to recall the jobs running fine after that happened [19:08:54] Hi labs people :) Do you think someone could take care of that one before tomorrow (it would unblock me for tomorrow morning): https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Joal [19:12:34] ^ madhuvishy, about the crontabs. Any ideas? :'( [19:12:57] musikanimal: no that wasn't part of the maintenance [19:13:42] well I wonder what happened. I assume this was not intentional [19:14:12] no, tools was unaffected in the maintenance [19:15:31] cool, well looking at my bot's contribs, I'm going to guess the crontabs were cleared some time in the past 8 hours [19:15:42] which is not cool :'( [19:15:53] oh - which bot is this? [19:15:55] for some of the tools I don't have the configurations backed up [19:15:56] tool* [19:15:58] all of them [19:16:06] musikbot, eranbot, xtools [19:16:10] looking [19:16:23] all of the crontabs are in the initial state, as if it were a fresh new tool [19:16:33] that's.. weird [19:18:13] I know you can run sudo something something to see when the crontabs were last edited, and perhaps by whom [19:19:02] sudo ls -l /var/spool/cron/crontabs/$USER [19:19:19] tools.yifeibot@tools-bastion-02:~$ ssh tools-cron-01 [19:19:20] Connection closed by 10.68.23.89 [19:20:35] tools-cron-01 malfunction? [19:21:07] I hope so! that is better than them being completely cleared, assuming we can get it back up and running [19:21:12] actually labs team, sorry to have bothered, looks like I was on the wrong path - nevermind my previous request [19:22:26] zhuyifei1999_: I was able to SSH in using my normal bastion user, without "become"ing a tool [19:23:39] hmm [19:24:17] i can login to tools-cron-01 [19:24:21] musikanimal: time to back up your crontab :P [19:24:43] *was* time, too late now :( [19:25:08] you can still `become` in tools-cron-01 [19:25:39] oh nice!!! [19:25:43] okay, backing up now! [19:26:03] basically all your crontabs are saved/edited/executed on tools-cron-01 [19:26:56] idk why ssh within service accounts are broken [19:27:27] musikanimal: yeah i can see all the crons are still intact there [19:27:32] but something is broken [19:28:06] madhuvishy: is there a debug log for sshd? [19:28:15] it doesn't seem like an auth failure [19:29:01] last lines with ssh -vvv: [19:29:10] debug2: userauth_hostbased: chost tools-bastion-02.tools.eqiad.wmflabs. [19:29:10] debug3: ssh_msg_send: type 2 [19:29:11] debug1: permanently_drop_suid: 51201 [19:29:11] debug3: ssh_msg_recv entering [19:29:11] debug3: ssh_keysign: [child] pid=23881, exec /usr/lib/openssh/ssh-keysign [19:29:12] debug2: we sent a hostbased packet, wait for reply [19:29:12] Connection closed by 10.68.23.89 [19:36:05] musikanimal: i'm still not sure what's going on, but fyi we also backup crontabs in nfs (TIL) [19:36:24] good to know :) [19:37:07] I've just now backed all of mine up by SSH in to tools-cron-01 [19:37:39] https://www.irccloud.com/pastebin/beHb0Adc/ [19:40:03] I think I'll just file a ticket [19:41:54] zhuyifei1999_: thanks [19:42:07] also I forgot to change back my timezone on wiki, the last edit via cron by my bot was 40 minutes ago [19:42:18] so everything is probably still working [19:42:22] musikanimal: yeah it's all working [19:42:38] yay :) [19:42:40] not sure why your local crontab appears cleared [19:43:01] yeah, this is true for all the tools that I have access to [19:43:16] I guess if I need to modify them I'll have to go through tools-cron-01? [19:46:36] don't modify crons directy on the tools-cron hosts [19:46:39] musikanimal: ideally not [19:46:48] ok [19:47:00] we're looking right now [19:47:10] 06Labs, 10Tool-Labs: Tool labs crontab host tools-cron-01 cannot be ssh-ed into within a service user - https://phabricator.wikimedia.org/T156168#2966289 (10zhuyifei1999) [19:54:11] Hi all, is it normal that I can't access to cron on tools ? [19:54:12] tools.framabot@tools-bastion-03:~$ crontab -l [19:54:13] Connection closed by 10.68.23.89 [19:54:46] framawiki: we are currently looking into it [19:54:52] framawiki: T156168 just filed [19:54:52] T156168: Tool labs crontab host tools-cron-01 cannot be ssh-ed into within a service user - https://phabricator.wikimedia.org/T156168 [19:57:06] Thanks for the link, do you know if crons are normally executed ? [19:57:39] "Crontabs are intact." [19:58:53] framawiki: yes [19:59:34] musikanimal: zhuyifei1999_ does crontab -l as a tool user normally work from teh bastion? (just verifying) [19:59:46] yes [20:00:00] k [20:04:19] joal: you appear to already be a member of the tools project; is there anything else you need? [20:04:29] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Joal was modified, changed by Andrew Bogott link https://wikitech.wikimedia.org/w/index.php?diff=1385324 edit summary: [20:08:25] Hi andrewbogott, sorry, made a mistake in procdure: I wanted to create a new tool [20:09:21] andrewbogott: I however have another question - I created the tool sqoop-tool, but it seems I don't have the replica.cnf file allowing db access :( [20:10:12] joal: ok, so, just to confirm… you already can/know how to create tools, correct? [20:10:34] andrewbogott: I followed the doc best I could :) [20:10:43] I don't know how replica.cnf works at all, so I defer to madhuvishy and/or bd808 [20:10:58] I now can become sqoop-tool on tools-bastion-03 [20:11:02] joal: wait 10 minutes [20:11:07] sure bd808 [20:11:08] joal: i'll look and fix in a bit [20:11:19] it should be faster than that but sometimes it's not [20:11:21] if it doesn't show up [20:11:27] wow, 3 people just me :) [20:11:30] Thanks guys ! [20:11:53] bd808: I created the tool like 2 or 3 hours ago [20:15:01] joal: the system that does that is currently frozen for puppet and needs some attention but I'm stuck elsewhere [20:15:07] if not there by tomorrow drop me a note [20:15:38] chasemp, madhuvishy, bd808, andrewbogott: Thank you all for the help, will check tomorrow and let you know ! [20:41:07] fix coming in for crontab editing issue - should be resolved soon [20:42:10] (03CR) 10Andrew Bogott: [C: 032] Add link to OATH documentation on wikitech [labs/striker] - 10https://gerrit.wikimedia.org/r/333321 (owner: 10BryanDavis) [20:43:43] (03Merged) 10jenkins-bot: Add link to OATH documentation on wikitech [labs/striker] - 10https://gerrit.wikimedia.org/r/333321 (owner: 10BryanDavis) [20:43:50] 06Labs, 10Tool-Labs: Rewrite /usr/local/bin/crontab in python; fix bugs - https://phabricator.wikimedia.org/T156174#2966501 (10bd808) [20:59:37] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool labs crontab host tools-cron-01 cannot be ssh-ed into within a service user - https://phabricator.wikimedia.org/T156168#2966289 (10chasemp) seems to work now from bastion-03 to tools-cron-01 using `crontab -e` as a test tool. Thanks for the report @zhuyifei1999... [20:59:43] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool labs crontab host tools-cron-01 cannot be ssh-ed into within a service user - https://phabricator.wikimedia.org/T156168#2966594 (10chasemp) 05Open>03Resolved a:03chasemp [21:02:11] 10Tool-Labs-tools-LTA-Knowledgebase: Add option for tool admins to change write access - https://phabricator.wikimedia.org/T156181#2966635 (10DatGuy) [21:02:20] joal: you should have a replica.my.cnf now [21:30:24] !log video git pulling v2c code to 8cf788e across all encoding hosts [21:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [21:33:04] !log video depooling encoding02 & 03, repooling encoding01 [21:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [21:37:00] 06Labs, 10Horizon, 06Operations, 07Puppet: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2966807 (10Andrew) - I will double-check the caching, although I'm pretty sure I verified that the cache was working previously. - I'm currently experimenting with the next rev of Hor... [21:52:39] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool labs crontab host tools-cron-01 cannot be ssh-ed into within a service user - https://phabricator.wikimedia.org/T156168#2966858 (10MusikAnimal) Thanks for the quick fix! [22:01:18] (03PS1) 10BryanDavis: Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/333989 [22:02:08] (03CR) 10BryanDavis: [C: 032] Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/333989 (owner: 10BryanDavis) [22:02:15] (03Merged) 10jenkins-bot: Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/333989 (owner: 10BryanDavis) [22:09:42] 10Striker, 15User-bd808: Deploy Striker account creation and management workflow - https://phabricator.wikimedia.org/T156195#2966952 (10bd808) [22:10:20] 10Striker, 15User-bd808: Deploy Striker account creation and management workflow - https://phabricator.wikimedia.org/T156195#2966972 (10bd808) [22:25:33] Awesome ! Thanks yuvipanda :) [22:48:58] 10Striker, 15User-bd808: Deploy Striker account creation and management workflow - https://phabricator.wikimedia.org/T156195#2967121 (10bd808) [23:02:33] 06Labs, 10Tool-Labs: Rewrite /usr/local/bin/crontab in python; fix bugs - https://phabricator.wikimedia.org/T156174#2966501 (10scfc) It would be nice to split out the "template" bit of it; i. e., let Puppet write `@cron_host` to some file in `/etc` and make `/usr/local/bin/crontab` read from that instead of ch... [23:50:38] 10Striker, 15User-bd808: Deploy Striker account creation and management workflow - https://phabricator.wikimedia.org/T156195#2967338 (10bd808) The prod secrets need be appended to the existing to the existing `striker::uwsgi::secret_config` hiera hash and look something like: ``` striker::uwsgi::secret_config:...