[01:27:18] PROBLEM - HDFS capacity used percentage on analytics1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [90.0] [01:30:18] ACKNOWLEDGEMENT - HDFS capacity used percentage on analytics1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [90.0] ottomata Will be able to delete data soon. [07:42:29] morning! [07:42:34] * elukey coffee before starting [08:18:56] Heya team [08:23:21] elukey: we reached 90% of hdfs usage? [08:31:18] joal: o/ it is happened two times in a row for the past two days [08:31:21] but andrew acked it [08:31:38] didn't check the status but IIRC he wanted to wait a bit [08:40:00] elukey: yes he wanted to wait -- I think we should be carefull on how fast we fill-in now - I'd rather not fail to import webrequest or something like taht [08:45:15] yep, but we should still have 200T right? So we are good for at least 2/3 days? [08:45:55] we have 157Tb - We should be safe for a few days, but I'd really like to be safer :) [08:50:20] oh yes me too, but I expect to have moar free space by EOD after standup [10:43:14] has anybody already submitted the all hands travel form? I have a doubt [11:15:49] elukey: not me :P [11:16:09] elukey: do you mind trying to deploy AQS modules-upgrade to beta? [11:16:20] elukey: do you mind ME dpeloying sorry [11:16:57] nope :) [11:17:05] ok, will do that then [11:19:12] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy on beta." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/384590 (https://phabricator.wikimedia.org/T178312) (owner: 10Joal) [11:39:59] (03PS1) 10Joal: Update aqs to c1edede [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/385174 [11:43:42] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy in beta." [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/385174 (owner: 10Joal) [11:44:22] !log deploying AQS in b [11:44:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:44:26] Again ... [11:44:29] !log deploying AQS in beta [11:44:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:47:41] elukey: I can't recall the hostnames of aqs in beta - a hint for me? [11:47:55] deployment-aqs03.deployment-prep.eqiad.wmflabs [11:47:58] 01 02 03 [11:47:58] :) [11:48:03] Thanks mate :) [11:48:15] This tells about how often I use beta .... [12:13:05] elukey: I think my deploy will alert on labs because of monitoring of an endpoihnt with no druid [12:17:27] joal: what do you mean? We don't have alerting in labs.. do you mean that aqs will do something weird? [12:18:18] elukey: in prod-aqs, we have per-endpoint monitoring, doing regular queries and expecting results [12:18:34] elukey: I have this setup for one endpoint usidruid backend [12:18:37] sure but it should be disabled in labs [12:18:47] elukey: That great to hear:) [12:18:57] elukey: deployment failed nonetheless :( [12:21:07] I was about to go to lunch but let me know if there is anything that I can do [12:21:24] elukey: I'm trying to debug, but not easy [12:21:39] elukey: please go to lunch, I'll ry to manage, and cry until you get back ;) [12:22:29] OSError: [Errno 107] Port 7232 not up within 120.00s ouch [12:22:51] all right should be away for 30/40 mins, after that I'll help if still needed joal ! [12:22:54] * elukey lunch [12:57:52] elukey: I'm fixing our repository setup so that starting from scratch we no longer have thirdparty, but instead thirdparty/hwraid for all physical servers and then specific includes within the roles (e.g. the CI hosts are including thirdparty/ci for jenkins [12:58:14] confluent-kafka-2.11 is in thirdparty and needs to be moved to a separate component [12:59:33] any proposal, e.g. thirdparty/kafka or maybe thirdparty/confluent? [12:59:53] the idea is that all those external debs are limited to the hosts which specifically need them [13:00:33] we're already doing the same for thirdparty/cloudera in jessie for the hadoop hosts [13:02:47] moritzm: I'd say thirdparty/confluent but let's wait ottomata since he is passionate about names and I usually pick up the wrong ones :D [13:03:02] joal: any luck? [13:07:40] ok :-) [13:12:27] ottomata: good morning! let me copy some backscroll: [13:12:33] I'm fixing our repository setup so that starting from scratch we no longer have thirdparty, but instead thirdparty/hwraid for all physical servers and then specific includes within the roles (e.g. the CI hosts are including thirdparty/ci for jenkins [13:12:36] confluent-kafka-2.11 is in thirdparty and needs to be moved to a separate component [13:12:40] any proposal, e.g. thirdparty/kafka or maybe thirdparty/confluent? [13:12:43] the idea is that all those external debs are limited to the hosts which specifically need them [13:12:47] we're already doing the same for thirdparty/cloudera in jessie for the hadoop hosts [13:12:56] moritzm: I'd say thirdparty/confluent but let's wait ottomata since he is passionate about names and I usually pick up the wrong ones :D [13:14:12] there's also two packages in stretch-wikimedia/thirdparty, which should not be in thirdparty: prometheus-jmx-exporter and jmxtrans [13:14:40] I'll move these to main, starting with stretch thirdparty should only be used for packages we sync from external repositories [13:15:02] moritzm: +1 to all proposals above :) [13:15:40] which one? :-) thirdparty/confluent or thirdparty/kafka? [13:16:31] oh confluent [13:16:45] k [13:16:49] there may be other confluent debs we add one day [13:30:59] ottomata: o/ I found a way to fix the jumbo metrics [13:31:03] it should be ok now [13:34:48] jumbo metrics? [13:35:18] after the merge they disappeared yesterday [13:35:43] https://gerrit.wikimedia.org/r/#/c/385153/ [13:36:15] (I reverted it since I thought it wasn't working but it was a race condition with the exported resources) [13:36:23] (just rolled it out now, all good) [13:36:40] OH! [13:38:23] elukey, ottomata: I'm going to try to take 20' next week to present what I've been doing around our Maven builds... should I send you an invite? Any preferred time? [13:38:57] I am totally ignorant but I'd like to participate! [13:39:28] when would it be good for you and your audience to schedule the presentation? [13:39:45] (do you need us people etc..) [13:40:53] Ahh, elukey previously the title was expected to be the hostname? [13:41:04] gehel def add joal [13:41:10] maybe next week sometime? [13:41:19] probably around this time is good, 10am someting? [13:41:27] euro folks are up, i'm up, its before meetings start [13:41:29] 10am EST [13:41:41] 14:00 UTC? [13:43:53] ottomata: yeah :( [13:44:30] confused though [13:44:34] how does this cassandra piece work [13:44:45] if it uses hostname as ${::hostname}-${instance_name} ? [13:46:36] (03PS1) 10Fdans: Use only js Date objects internally [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385186 [13:47:04] ottomata: all the cassandra instances have a dedicated domain, other than the one of the host on which they run [13:47:17] dedicated domain? [13:47:21] (03PS2) 10Fdans: Use only js Date objects internally [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385186 (https://phabricator.wikimedia.org/T178461) [13:47:29] for example, we have aqs1004.eqiad.wmnet, aqs1004-a.eqiad.wmnet, aqs1004-b.eqiad.wmnet [13:47:31] oh, you mean that is a real dns name? [13:47:35] yeah! [13:47:39] ah cool [13:47:46] elukey: maybe in the define you can do [13:47:48] $hostname = $title [13:47:53] and make it the default, but not required [13:48:47] I preferred to make it explicit to be clear and avoid mistakes but we can do anything [13:48:55] hmmm, yeah maybe you are right. actually yeah [13:49:04] beacuse in many cases the title won't be the hostname [13:49:06] for jmx exporter [13:49:11] it'll be some jvm name [13:49:13] ok [13:49:20] cool, thanks elukey! [13:49:38] elukey: did you see this one yet? [13:49:38] https://gerrit.wikimedia.org/r/#/c/384586/ [13:49:47] i think i need some help on how to test jmx exporter [13:50:49] ah yes it is in my todo list! I'll try to do it today! [13:53:27] the kafka dashboard is still showing up not metrics https://grafana-admin.wikimedia.org/dashboard/db/prometheus-kafka [13:53:30] ufff [13:57:03] elukey, ottomata: invite sent, a bit later (9am SF time) to account for my team member in that timezone. But I'm happy to do another session for you if that one does not work. [14:00:09] oook fixed! [14:00:13] thanks gehel ! [14:27:23]  [14:27:39] elukey: no luck [14:28:19] elukey: I think it commes from cassandra config [14:29:14] joal: lemme check the config.yaml, iirc it was changed a bit during the last fix so we might need to tweak it [14:29:41] elukey: use aqs01 or 02, I have manually changed the one on 03 (after disabling puppet) [14:29:56] * joal hopes elukey won't take offence [14:32:44] I saw it :) [14:32:58] * joal knew elukey knew [14:33:00] so I can't find what I meant in puppet.log (weird) [14:34:32] :s [14:37:17] {"name":"aqs","hostname":"deployment-aqs01","pid":11,"level":60,"err":{"innerErrors":{"10.68.18.237:9042":{"name":"AuthenticationError","stack":"Error: Username and/or password are incorrect\n [14:37:21] ahhhh [14:37:23] joal: --^ [14:37:40] elukey: I knew that one [14:37:48] elukey: sorry, I thought you knew ;) [14:37:52] :D [14:38:10] no just started checking it [14:38:17] should we fix the auth then? [14:38:22] ok - so it seems to be indeed a user/password error [14:38:35] I don't know what casssandra expects u [14:38:41] in deployment-prep [14:42:15] joal, elukey I think there's issues in deployment of refinery... the refinery repo in stat1005 is in error status [14:42:32] mforns: --verbose [14:42:33] :D [14:42:37] hehe [14:43:46] elukey, https://pastebin.com/P28HFP00 [14:44:16] seems some git files have wrong permissions? [14:45:59] elukey, stat1004 seems ok [14:46:31] I just did scap deploy before that [14:48:50] did it succeed? [14:49:06] it says OSError: [Errno 13] Permission denied: '.git/fat/objects/tmpo_bxnZ' [14:49:09] that doesn't exist [14:49:34] elukey, yes it suceeded [14:50:38] joal: breakdowns work now in new metrics and I'm so damn thrilled :D :D :D [14:50:57] :D [14:51:40] mforns: so you deployed to all the hosts right? [14:51:59] (deploy msg next time so ops will be happy :) [14:52:15] fdans: Demo at standup please :) [14:52:59] joal it would be soooo cool to have this kind of graph broken down for bytes removed and bytes added [14:53:08] https://usercontent.irccloud-cdn.com/file/sVlZT1hl/Screen%20Shot%202017-10-19%20at%2016.52.30.png [14:53:11] elukey, oh, will do [14:55:22] fdans: you mean having 2 new metrics, one for bytes with +only and bytes with - only, right? [14:56:58] couldn't it be a breakdown in absolute bytes diff? [14:57:12] although the name of the metric wouldn't make a lot of sense [14:57:34] I just thought it would be pretty insightful to have that visualised [14:57:42] fdans: it makes snese [14:57:49] mforns: it was in a detached head, so I checkout master and did a git reset --hard origin/master [14:57:52] seems good now [14:58:55] let me know [14:59:24] ah weird an1003 is indeed in detached as well [14:59:58] elukey, 1004 is in detached too, but not erroring on git status [15:00:25] mforns: can you try to deploy only to stat1005? scap deploy --limit stat1005.eqiad.wmnet [15:00:28] or I can do it [15:00:38] elukey, sure! [15:02:25] elukey, running [15:03:01] elukey, done. looks good [15:04:24] 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3657517 (10elukey) a:05Cmjohnson>03elukey [15:05:05] super [15:05:11] (deployment msg :P_ [15:08:07] 10Analytics, 10DBA, 10Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3696790 (10jcrespo) [15:08:10] 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3696789 (10jcrespo) [15:14:37] elukey, !!! depl mesg :[ sorry [15:38:31] elukey: how do I test jmx exporter! i tried to do [15:38:32] java -cp jmx_exporter.jar io.prometheus.jmx.JmxScraper service:jmx:rmi:your_url [15:38:37] but i'm not sure if i was doing it right [15:38:45] i want to see what metrics would be exported with my config [15:38:49] i've got it running in labs on k2-1 [15:41:23] ottomata: I tested kafka's jmx exporter in labs after applying the agent, not running via java -cp :( [15:41:56] how did you do that? [15:42:48] simply curling port 7800/metrics [15:43:31] (after adding the -javaagent etc.. parameter to the jvm) [15:43:44] mirror maker will have to pick up a new port [15:49:39] 10Analytics: Why do we allow "bot" in metrics/pageviews/per-article - https://phabricator.wikimedia.org/T178448#3692674 (10fdans) There used to be a difference between bots and spiders, so we keep this for historical reasons. Let's aim for removing the category that has no data [15:50:09] 10Analytics: Remove "bot" from metrics/pageviews/per-article - https://phabricator.wikimedia.org/T178448#3696861 (10fdans) [15:51:46] 10Analytics-Kanban, 10Analytics-Wikistats: Fonts from fonts.googleapis.com on wikistats - https://phabricator.wikimedia.org/T178317#3696865 (10fdans) [15:52:04] ah / metrics! [15:52:46] ah elukey i was curling localhost [15:52:58] it only binds to IP [15:53:58] 10Analytics-Kanban, 10Patch-For-Review: Add link to footer of wikistats with "file a bug" - https://phabricator.wikimedia.org/T177642#3665482 (10fdans) [15:59:37] 10Analytics, 10DBA: Access to x1 broken on stat1006 - https://phabricator.wikimedia.org/T178237#3685563 (10elukey) Yes we switched the CNAME on purpose a while ago (https://gerrit.wikimedia.org/r/#/c/378211/1/templates/wmnet) to avoid access from the research user to db1029. We could set up on dbstore1002 the... [16:02:44] 10Analytics, 10DBA: Access to x1 broken on stat1006 - https://phabricator.wikimedia.org/T178237#3685563 (10jcrespo) This is a duplicate. The problem is it is not easy to replicate x1 because it duplicates db names (e.g. enwiki and enwiki are both on s1 and x1). I promised to provide temporary access soon, if I... [16:03:48] 10Analytics, 10DBA: Access to x1 broken on stat1006 - https://phabricator.wikimedia.org/T178237#3696908 (10jcrespo) [16:03:54] 10Analytics: Provide historical redirect flag in Data Lake edit data - https://phabricator.wikimedia.org/T161146#3122912 (10fdans) We cannot do this without plain text parsing, so moving this to Q3 [16:15:38] 10Analytics, 10Analytics-Cluster: Enable base::firewall on stat boxes after restricting Spark REPL ports. - https://phabricator.wikimedia.org/T170826#3443826 (10fdans) If there is a deterministic range of ports this should be easy but we need to research it first. [16:24:06] 10Analytics-Kanban: Fix wikimedia-history revision-deleted data - https://phabricator.wikimedia.org/T178587#3696992 (10JAllemandou) [16:28:11] elukey: https://github.com/nodefluent/prometheus-kafka-connect :) basically statsv for prometheus? [16:29:13] 10Analytics-Kanban: Fix EventLogging editCountBucket fields historically - https://phabricator.wikimedia.org/T169674#3697040 (10fdans) [16:30:28] ottomata: so we could publish druid's metrics on kafka and use the connector to stream them to prometheus for example? [16:30:39] ya [16:31:11] that seems really nice! [16:35:31] team, I need to run an errand, will be back in 1 hour or so [16:38:58] 10Analytics, 10Analytics-Wikistats: Feedback on hive table mediawiki_history by Erik Z - https://phabricator.wikimedia.org/T178591#3697072 (10Erik_Zachte) [16:41:26] joal: aqs | data | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'} [16:41:36] is aqs able to do stuff on all the keyspaces? [16:41:49] anyhow, will check tomorrow :) [16:41:51] * elukey off! [16:41:52] byyee [16:44:04] 10Analytics, 10Analytics-Wikistats: Feedback on hive table mediawiki_history by Erik Z - https://phabricator.wikimedia.org/T178591#3697113 (10Erik_Zachte) I see //revision_is_deleted//, but how about //page_is_deleted//? My understanding was that everything was kept in de database forever, so deleted are mer... [17:10:37] 10Analytics, 10Analytics-Wikistats: Feedback on hive table mediawiki_history by Erik Z - https://phabricator.wikimedia.org/T178591#3697231 (10Erik_Zachte) Building on the previous comment (about page deletions): I queried for a very small set of titles in one wiki in one namespace, so I could compare title fo... [17:20:21] 10Analytics, 10Analytics-Wikistats: Feedback on hive table mediawiki_history by Erik Z - https://phabricator.wikimedia.org/T178591#3697260 (10Erik_Zachte) Question: with deleted revisions still somewhere in the database, as column //revision_is_deleted// suggests: should these be shielded from the public once... [17:22:46] 10Analytics, 10CirrusSearch, 10Discovery, 10Discovery-Search: Load cirrussearch data into druid - https://phabricator.wikimedia.org/T156037#3697267 (10debt) Moving this to later, as we just don't have time for this right now and Analytics is still working on their portion of this type of work. [17:27:12] 10Analytics, 10Analytics-Wikistats: Feedback on hive table mediawiki_history by Erik Z - https://phabricator.wikimedia.org/T178591#3697317 (10Erik_Zachte) There are columns //event_user_is_bot_by_name// and //user_is_bot_by_name//, but not //event_user_is_bot// or //user_is_bot//. Wouldn't that make sense to h... [17:30:34] 10Analytics, 10Analytics-Wikistats: Feedback on hive table mediawiki_history by Erik Z - https://phabricator.wikimedia.org/T178591#3697328 (10Erik_Zachte) There is //page_is_redirect_latest//, I imagine it could be very useful to also have a field to which page id or page title the redirect goes. For example f... [18:02:03] 10Analytics, 10Analytics-Dashiki: Add option to not truncate Y-axis - https://phabricator.wikimedia.org/T178602#3697395 (10Nettrom) [18:25:15] baack [19:17:11] ottomata, yt? [19:18:37] ya [19:19:12] mforns: waasssssup? [19:21:32] elukey: ok i can see jvm metrics, but i have no idea if jmx exporter likes my config for kafka [19:21:34] dont' see anyting in output [19:50:08] ottomata, sorry no headphones.. do you know why stat1005's refinery repo is erroring on git status? [19:53:47] no, but looking [19:54:06] git fat not happy hm [19:54:21] going to try some hammers [19:55:17] mforns: dunno, i just did sudo -u analytics git reset --hard [19:55:20] seems happier [19:55:32] oh cool [19:55:48] thx [19:55:55] will try to continue deployment [20:04:54] !log Deployed refinery using scap, then deployed onto hdfs [20:04:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log