[08:27:19] <addshore>	 hey elukey & joal & anyone else that is awake :D
[08:28:57] <joal>	 Hi addshore :)
[08:29:17] <addshore>	 So, I fully tested those 2 patches and they both work :)
[08:29:25] <elukey>	 o/
[08:29:41] <addshore>	 elukey: is there any chance you could review https://gerrit.wikimedia.org/r/#/c/302130/ for me? :)
[08:30:08] <addshore>	 joal: it took me a little while to get the regex right, but I eventually got it :)
[08:35:03] <elukey>	 addshore: le patch looks good but I am not familiar with the analytics-slave, can you tell me what does this patch do? Just to play the part of the paranoid ops
[08:35:06] <elukey>	 :D
[08:35:11] <mforns>	 hi team
[08:35:50] <addshore>	 elukey: so that patch just allows my scripts to choose between the slave / master so then the scripts can split between the two rather than always running read only queries against master
[08:36:15] <addshore>	 unless for some reason analytics-slave is a thing I should not be used! :O
[08:36:18] <addshore>	 *using
[08:37:16] <joal>	 elukey: Do you know what happened to analytics 1032/1045 ? Disk failurs?
[08:37:48] <elukey>	 joal: helloooo
[08:38:01] <joal>	 elukey: o/
[08:38:23] <mforns>	 joal, hi! do you have 5 mins to batcave on edit history?
[08:38:26] <elukey>	 I still need to investigate it, I checked with andrew over the weekend and it seems that 1045 had some trouble with a disk triggering a hw raid controller reset
[08:38:30] <elukey>	 I have no idea why
[08:38:40] <elukey>	 the logs are a bit messy and not really useful
[08:39:09] <joal>	 elukey: Thanks for the headsup :)
[08:39:16] <joal>	 mforns: o/ !
[08:39:25] <joal>	 mforns: On my way to the batcave !
[08:39:29] <mforns>	 joal, ok
[08:40:12] <elukey>	 addshore: yeah exactly, this is my doubt.. I am not sure if you have to switch or not
[08:40:18] <elukey>	 buut we can figure it out
[08:40:24] <elukey>	 I'll ask around today
[08:40:30] <addshore>	 cool! :)
[09:08:30] <joal>	 elukey: About cassandra and RAID10, I stopped loading ;)
[09:09:15] <elukey>	 I didn't get exactly from the community what we need to do :D
[09:09:29] <elukey>	 I'd like to discuss it with the team this evening what do you think?
[09:09:33] <elukey>	 then take the decision
[09:09:36] <joal>	 sounds good to me
[09:09:51] <elukey>	 I should be able to replace the cluster in one day
[09:09:56] <elukey>	 with raid10
[09:10:07] <joal>	 elukey: my feeling is that we're gonna go with raid10 and adapt functionality
[09:10:19] <joal>	 elukey: But we should ensure that with team
[09:10:25] <elukey>	 yep
[09:12:53] <elukey>	 addshore: afaiu what we call informally analytics-slave (db1047) also contains eventlogging data
[09:13:22] <elukey>	 meanwhile analytics-store (dbstore1002) "only" all s* shards (wikis, etc..
[09:13:43] <elukey>	 so it should be fine to query both
[09:13:55] <elukey>	 but I'd ask to ottomata later on during the day to confirm
[09:13:58] <elukey>	 ok?
[09:14:26] <addshore>	 yup, thats fine :)
[09:15:08] <addshore>	 basically I split the scripts to run all of the easy queries on the slave and then anything a bit more complex / requires writing to temp tables etc to the master
[09:15:53] <elukey>	 no idea what is the convention, but I am really interested
[09:15:59] <addshore>	 :D
[09:16:15] <addshore>	 I only realised there was a second server a few days ago when reading the mailing lists
[09:31:45] <grrrit-wm>	 (CR) Joal: "Still one syle-comment ,but overall looks good." (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301657 (https://phabricator.wikimedia.org/T141525) (owner: Addshore)
[09:32:39] <grrrit-wm>	 (CR) Joal: "Waiting for nuria'sv approval, good for me :)" [analytics/refinery] - https://gerrit.wikimedia.org/r/301661 (https://phabricator.wikimedia.org/T141525) (owner: Addshore)
[09:33:54] <joal>	 addshore: almost there on the scala one, oozie is ok for me.
[09:34:14] <joal>	 addshore: Still waiting for nuria since she had reviewed the previous patches :)
[09:34:51] <addshore>	 okay! :)
[10:05:44] <grrrit-wm>	 (PS8) Addshore: Create WikidataSpecialEntityDataMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301657 (https://phabricator.wikimedia.org/T141525)
[10:05:50] <grrrit-wm>	 (CR) Addshore: Create WikidataSpecialEntityDataMetrics (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301657 (https://phabricator.wikimedia.org/T141525) (owner: Addshore)
[11:54:27] <elukey>	 joal: https://phabricator.wikimedia.org/T141761
[11:56:49] <elukey>	 also from what I can see we have a lot of errors logged for high temp in dmesg
[11:56:58] <elukey>	 I'll try to ping again Chris
[12:33:48] <elukey>	 joal: I've also read https://lostechies.com/ryansvihla/2014/09/22/cassandra-auth-never-use-the-cassandra-user-in-production/, thanks for posting it.. One more reason to fix our current cassandra settings :/
[12:34:08] <elukey>	 not sure how to safely change users in production though, might need to research a bit
[13:14:48] <wikibugs>	 Analytics-Cluster, Operations, ops-eqiad: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2511466 (elukey) @Cmjohnson ping :)
[13:24:51] <wikibugs>	 Analytics, Editing-Analysis, Graphite, Performance-Team, VisualEditor: Statsv down, affects metrics from beacon/statsv (e.g. VisualEditor, mw-js-deprecate) - https://phabricator.wikimedia.org/T141054#2485441 (Gilles) Did statsv recover by itself from the DNS issue or did someone restart it?
[14:09:28] <elukey>	 I did some calculations and with RAID10 on aqs we should be able to tolerate 4 disks failures easily. The only caveat is that if we loose two disks in the root raid10 partition we could loose an entire node and hence two instances
[14:09:43] <elukey>	 so the 4 disks failure worst worst case scenario will not hold anymore
[14:10:02] <elukey>	 I mean, we won't tolerate in the worst worst case scenario 4 disks failure
[14:10:26] <elukey>	 there are also the other scenarios like one rack down (two instances dead) and one / two disks failures
[14:10:44] <elukey>	 that would mean for us trouble
[14:11:07] <elukey>	 but overall raid10 would guarantee a much stronger resiliency
[14:12:57] <elukey>	 also from the space perspective, we moved from 3 instances and a ~2.4TB space required for a year on per instance to 6 instances and ~1.2TB of space required for a year
[14:13:49] <elukey>	 so maybe simply adding threee new physical nodes, raising the cluster to 12 instances, would mean cutting the space requirements again in half
[14:14:24] <elukey>	 a year of page views cost is a bit tricky thoug
[14:14:27] <elukey>	 *though
[14:15:00] * elukey is talking nonsense with himself
[14:15:45] <joal>	 elukey: be carefull, I listen  ;-P
[14:15:54] <elukey>	 hahahahah
[14:15:57] <elukey>	 I knew it! :P
[14:16:00] <joal>	 :D
[14:16:04] <elukey>	 does it make sense?
[14:16:28] <elukey>	 I am reasoning out loud becuase this cluster is driving me a bit mad
[14:16:33] <joal>	 from a resiliency perspective yes, not sure I understand the space thing
[14:16:53] <elukey>	 probably I wrote non sense for real :)
[14:17:33] <elukey>	 so my thoughts are about how much space will a year of page view data be for each instance when we scale up the ring
[14:17:40] <elukey>	 3,6,12 instances etc..
[14:17:51] <elukey>	 from 3 to 6 it seems that we have cut it in half
[14:18:16] <elukey>	 so I was trying to understand if from 6 to 12 instances would do the same trick
[14:19:09] <elukey>	 (because the ring would be split among more instances allowing a better data partitioning)
[14:19:44] <joal>	 elukey: better data partitionning is just "more partitions" in that case, right?
[14:20:28] <elukey>	 if we get down to ~600GB per instance (even something more) we could store 3 years of page views with.. ~2TB of space per instance?
[14:20:45] <elukey>	 joal: yes yes
[14:25:55] <joal>	 Let's put it this way elukey: 3 years of daily pageviews ~ 7.5Tb - Replication factor = 3 --> Total space needed = 22.5Tb
[14:26:06] <joal>	 elukey: Then you can divide by the number of machines
[14:26:22] <joal>	 but elukey, this is leaving compaction needed space out of the equation
[14:28:33] <elukey>	 joal: yeah your view is much better, would lead to something like 2TB as I imagined with 12 instances.. but in case of raid10, we'd have more than 3TB overall space for each instance. So 1TB dedicated to compaction
[14:29:01] <elukey>	 but I agree, who knows how our dear cassandra will behave at that scale
[14:29:27] <joal>	 elukey: We need more experience, but it seems needed compaction space can go up to (current data size) * 2
[14:29:34] <elukey>	 you are counting 2.5TB for a year?
[14:29:52] <joal>	 elukey: I rounded the number you gave ;)
[14:30:04] <joal>	 elukey: I think a year is closer to 2Tb
[14:30:13] <joal>	 elukey: Which is better ;)
[14:33:28] <elukey>	 :)
[14:33:43] <elukey>	 okok now I get the other angle, I misread some numbers :)
[14:34:36] <elukey>	 anyhow, doubling those 18 months could mean doubling the cluster (best case scenario with compaction)
[14:37:03] <nuria_>	 elukey, joal: i think is premature to chnage anything about lowering resolution of data on pageview api
[14:37:13] <nuria_>	 *change
[14:38:14] <elukey>	 nuria_: morning! We were just speculating how to store 3 years of data with raid10
[14:38:14] <joal>	 nuria_: Have we said so?
[14:38:33] <nuria_>	 elukey: hola
[14:38:49] <nuria_>	 elukey: sorry did not even say morning
[14:39:02] <joal>	 hi nuria_ o/
[14:39:29] <nuria_>	 elukey: i just want to make sure we do not go dow the path of lowering resolution just yet, because i do not think we need to do that now
[14:40:57] <nuria_>	 el*go down
[14:42:15] <elukey>	 nono I was trying to figure out how big a cluster should be to store X amount of pageviews
[14:42:35] <elukey>	 just to know what to expect if we go for raid10
[14:43:07] <elukey>	 because theoretically with 18 months of retention and 12 months of data already created we'd need to think about what to do soon :)
[14:44:23] <elukey>	 I would be happy to have raid10 everywhere and double check in 4/5 months the usage of the new cluster, because I suspect that very few people will need data older than 18 months
[14:45:51] <elukey>	 and raid10 wouldn't be the silver bullet of course
[14:46:57] <elukey>	 my goal is to unblock joal to load data asap.. It would take me probably one day to reimage the cluster and bring it up to speed again, the discussion about data retention might take a lot more
[14:49:21] <nuria_>	 elukey: for sure, it will take a lot more
[14:50:22] <joal>	 nuria_: the only issue I foresee without defining in advance life duration for data is the difficulty we'll have if we need to wipe SOMe
[14:51:27] <joal>	 elukey: --6
[14:51:31] <joal>	 -^
[14:51:32] <joal>	 sorry
[14:52:07] <nuria_>	 joal: on that regard if the concern is technical (say cassandra cannot possibly scale) we would have no choice. If the concern is $$$ for cluster we are not incurring in huge costs here given our hardware budget.
[14:52:46] <joal>	 nuria_: cassandra can scale :)
[14:53:09] <nuria_>	 joal: For technical concerns it will be up to us to define data boundaries, for which we need testing we have not done as - as you mentioned earlier- we do not know what to expect beyond a certain size
[14:54:13] <nuria_>	 joal: Then, even if we had to triple cluster size i think we will be well within our budget
[14:55:39] <joal>	 nuria_: I'm not saying we shouldn't keep data - I'm saying we should evaluate the cost of keeping it in regard of the usage that will be made of it
[14:56:03] <joal>	 nuria_: as for technical issues, space management we'll manage on the fly, but data deletion will be a bit less easy
[14:56:38] <nuria_>	 joal: agreed, but let's do that after we have the newcluster in place , we'll evaluate cost versus usage
[14:56:53] <joal>	 sounds good nuria_ :)
[14:57:24] <joal>	 nuria_, elukey : I was just pointing the issue of data deletion (if deletion is needed, which might not)
[14:57:40] <joal>	 after that, it's all about data loading and perf tests :)
[14:59:25] <elukey>	 joal: you are right and I agree, deletion with raid10 will be required in 6 months afaiu
[14:59:28] <elukey>	 no?
[14:59:38] <elukey>	 it is not something so far in the future
[15:00:13] <wikibugs>	 Analytics-Cluster, Analytics-Kanban, Deployment-Systems, scap, and 2 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2511674 (thcipriani) >>! In T129151#2499656, @Ottomata wrote: > ​Hm, will this fix all of the permissions and ownership recursively?  It >...
[15:00:20] <joal>	 elukey: that's correct I think
[15:01:51] <joal>	 elukey: I summarize the problem: in order to delete a row, you need to acctually override it, meaning an actual load with special value for deletion
[15:02:02] <joal>	 elukey: not that easy, but completely feasible
[15:03:24] <wikibugs>	 Analytics-Cluster, Analytics-Kanban, Deployment-Systems, scap, and 2 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2511680 (elukey) Update: this task is blocked until the pwstore vault will be usable again to store the new keyholder pass (hope that it wi...
[15:07:33] <elukey>	 joal: sigh
[15:07:59] <elukey>	 feasible but a bit weird, why the hell cassandra does not offer delete?
[15:08:44] <elukey>	 mmm maybe I am not getting the override thing
[15:08:45] <elukey>	 ANYHOW
[15:09:18] <joal>	 elukey: In order to delete, you neeed to know WHAT (meaning, the key)
[15:09:47] <joal>	 elukey: cassandra doesn't provide scanning  as SQL does, you need to tell it at least the partition key
[15:10:05] <elukey>	 ahhh okok now it makes sense
[15:10:06] <elukey>	 thanks :)
[15:10:11] <joal>	 ;)
[15:10:17] <elukey>	 still weird but it makes sense
[15:18:56] <wikibugs>	 Analytics-Kanban: User history: Fix the oldUserName and newUserName in blocks/groups log events - https://phabricator.wikimedia.org/T141773#2511710 (mforns)
[15:23:33] <wikibugs>	 Analytics-Kanban: User history: Adapt the user history reconstruction to use scaling by clustering - https://phabricator.wikimedia.org/T141774#2511728 (mforns)
[15:25:17] <wikibugs>	 Analytics-Kanban: User history: Adapt the user history reconstruction to use scaling by clustering - https://phabricator.wikimedia.org/T141774#2511751 (Nuria) I think this task might be a duplicate , adding it to main wikistats 2.0 task
[15:25:50] <wikibugs>	 Analytics-Kanban: User history: Adapt the user history reconstruction to use scaling by clustering - https://phabricator.wikimedia.org/T141774#2511767 (Nuria)
[15:25:52] <wikibugs>	 Analytics-Kanban: Wikistats 2.0. Edit Reports: Setting up a pipeline to source Historical Edit Data into hdfs {lama} - https://phabricator.wikimedia.org/T130256#2511766 (Nuria)
[15:28:04] <joal>	 hey mforns
[15:32:27] <elukey>	 joal: one weird thing in https://grafana.wikimedia.org/dashboard/db/aqs-elukey is that from the 27th (when we made the auth caching change) we started also to throttle internal traffic.. so it seems that we are allowing more in hitting the thresholds
[15:34:16] <joal>	 elukey: not sure I understand
[15:34:40] <joal>	 elukey: when looking at 30 days of data, you'll see throtlling for interenal has happenned before as well
[15:39:03] <elukey>	 yeah but it seemed super strange to me that we the throttling started at the same time
[15:39:18] <elukey>	 anyway, was just a curiosity
[15:43:19] <joal>	 milimetric: o/
[15:54:08] <wikibugs>	 Analytics-Kanban: Use scalable algo on enwiki - https://phabricator.wikimedia.org/T141778#2511831 (JAllemandou)
[15:56:12] <mforns>	 joal, hi!
[15:56:16] <joal>	 heya :)
[15:56:21] <mforns>	 want to chat 4 minutes pre standup?
[15:56:31] <joal>	 mforns: it'll wait post standup :)
[15:56:39] <mforns>	 ok
[15:57:26] <milimetric>	 hey joal
[15:57:32] <milimetric>	 I didn't get your ping... hm
[15:57:44] <joal>	 no problemo :)
[15:57:54] <joal>	 milimetric: we'll talk after standup with mforns :)
[16:00:00] <ottomata>	 milimetric:  btw, woohoo!  dude fixed the perfomance degradation for sync producer
[16:00:00] <ottomata>	 https://github.com/dpkp/kafka-python/pull/783
[16:00:27] <ottomata>	 can do the shortcut thing for now, and don't have to figure out el future stuff yet
[16:00:35] <ottomata>	 although I might anyway... just for this... dunno :)
[16:35:46] <wikibugs>	 Analytics, Research Ideas: Wikipedia main content losts sources because too reverts, try to preserve them - https://phabricator.wikimedia.org/T141177#2489399 (Milimetric) Untagging Analytics - we're an infrastructure team and this looks like a research topic.  The Snuggle project might be of interest.
[16:41:19] <wikibugs>	 Analytics, Editing-Analysis, Graphite, Performance-Team, VisualEditor: Statsv down, affects metrics from beacon/statsv (e.g. VisualEditor, mw-js-deprecate) - https://phabricator.wikimedia.org/T141054#2485441 (Milimetric) Open>Resolved a:Milimetric This seems fixed, if anyone did r...
[16:43:06] <wikibugs>	 Analytics: Adding top counts for wiki projects (ex: WikiProject:Medicine) to pageview API - https://phabricator.wikimedia.org/T141010#2512097 (Milimetric)
[16:44:26] <wikibugs>	 Analytics, Analytics-Wikistats: Design new UI for Wikistats 2.0 - https://phabricator.wikimedia.org/T140000#2512116 (Milimetric) p:Triage>Normal
[16:44:31] <wikibugs>	 Analytics: Adding top counts for wiki projects (ex: WikiProject:Medicine) to pageview API - https://phabricator.wikimedia.org/T141010#2512118 (Nuria) We need to tag pages with WikiProject in pageview hourly so that info is available when we load data into pageview API
[16:45:59] <wikibugs>	 Analytics, Research-and-Data-Backlog: Improve bot identification at scale - https://phabricator.wikimedia.org/T138207#2512122 (Milimetric) p:Triage>Normal
[16:52:25] <wikibugs>	 Analytics, Easy: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2512169 (Milimetric) p:Triage>Normal
[16:53:21] <wikibugs>	 Analytics, Easy: Add monthly request stats per article title to pageview api - https://phabricator.wikimedia.org/T139934#2512178 (Nuria) Careful, the unique devices endpoint should not be affected by these changes.
[16:54:19] <wikibugs>	 Analytics: User History: Populate the causedByUserId and causedByUserName fields in 'create' states. - https://phabricator.wikimedia.org/T139761#2512181 (Milimetric) p:Triage>Normal
[16:55:01] <wikibugs>	 Analytics: User History: Create documentation for the user history page - https://phabricator.wikimedia.org/T139763#2512183 (Milimetric) p:Triage>Normal
[16:55:05] <wikibugs>	 Analytics: User History: Add history of annonymous users? - https://phabricator.wikimedia.org/T139760#2512184 (Milimetric) p:Triage>Normal
[16:55:18] <wikibugs>	 Analytics, Analytics-Dashiki: Timeseries on browser reports broken when going back 18 months - https://phabricator.wikimedia.org/T141166#2512186 (Milimetric) p:Triage>Normal
[16:55:25] <wikibugs>	 Analytics, Analytics-Dashiki: Default date selection to currently applied date for browser reports - https://phabricator.wikimedia.org/T141165#2512189 (Milimetric) p:Triage>Normal
[16:58:57] <elukey>	 a-team: anybody to discuss aqs retention and next steps?
[16:59:04] <elukey>	 I just finished the ops meeting
[16:59:16] <addshore>	 oooh, nuria_ , when you get time could you please look at https://gerrit.wikimedia.org/r/#/c/301657/ and https://gerrit.wikimedia.org/r/#/c/301661/ ?
[16:59:42] <nuria_>	 addshore: sure!
[17:00:04] <addshore>	 thanks! I have tested both and they should be good to go! joal also reviewed them this morning!
[17:01:13] <elukey>	 ottomata: helloooooo! addshore asked me to review https://gerrit.wikimedia.org/r/#/c/302130/ but I have no idea of the analytics-store/slave use cases
[17:01:21] <nuria_>	 elukey: we are not going to change data retention for now , just created aticket to look into that as soon as we have our newer cluster in service: https://phabricator.wikimedia.org/T141789
[17:01:22] <elukey>	 I saw that one is holding eventlogging data
[17:02:01] <nuria_>	 elukey: makes sense? or were you thinking otherwise?
[17:02:36] <elukey>	 nuria_: makes sense, I just wanted to discuss if I could go ahead with the cluster reimage (to deploy raid10 everywhere) or not :)
[17:02:45] <nuria_>	 elukey: please do
[17:03:03] <elukey>	 nuria_: super, I'll dedicate tomorrow to it
[17:03:12] <nuria_>	 elukey: if we are to lower data resolution it will happen after, we also will likely need to add more nodes regardless
[17:03:41] <nuria_>	 elukey: this is so we split the project in phases
[17:03:58] <nuria_>	 elukey: we will look at lowering resolution as our next task if pertains
[17:04:34] <elukey>	 nuria_: sure, I just wanted to make clear that if I reimage the cluster with raid10 then we'll run out of space probably in 6 months
[17:04:57] <elukey>	 if everybody is onboard I am happy, and I agree that we can choose later on
[17:05:00] <elukey>	 not now
[17:05:01] <nuria_>	 elukey: right, which means that as soon as we have new cluster in service we probably want to provision more nodes
[17:05:20] <nuria_>	 elukey: and, at the same time, look at capacity and decide whether we need to lower resolution
[17:05:33] <ottomata>	 got it elukey
[17:05:33] <elukey>	 sure sure, I am completely ok
[17:05:48] <elukey>	 just wanted to finally complete the cluster to let joal start loading data
[17:06:43] <elukey>	 ottomata: if you have time I'd ask you for a --verbose
[17:06:49] <elukey>	 so I'll know next time
[17:06:54] <elukey>	 but even tomorrow
[17:07:10] <elukey>	 ah I saw the ping in ops
[17:07:11] <elukey>	 reading
[17:07:12] <elukey>	 :)
[17:07:59] * elukey will ping mforns
[17:08:55] <ottomata>	 ah yup
[17:09:13] <ottomata>	 elukey:  yeah, uh, my understanding is: there are some databases out there for research type folks to use
[17:09:19] <ottomata>	 the big one is analytics-slave, and it has all dbs
[17:09:27] <ottomata>	 but there are more, and I have n ever kept track of them
[17:09:50] <elukey>	 ottomata: okok if you don't know I am ok, I feel less ignorant :)
[17:10:11] <addshore>	 :D
[17:11:19] <grrrit-wm>	 (CR) Addshore: [C: 2] +Script to track echo mention status usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/302125 (https://phabricator.wikimedia.org/T140928) (owner: Addshore)
[17:11:22] <grrrit-wm>	 (CR) Addshore: [C: 2] +Script to track echo mention status usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/302127 (https://phabricator.wikimedia.org/T140928) (owner: Addshore)
[17:11:29] <grrrit-wm>	 (Merged) jenkins-bot: +Script to track echo mention status usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/302125 (https://phabricator.wikimedia.org/T140928) (owner: Addshore)
[17:11:32] <grrrit-wm>	 (Merged) jenkins-bot: +Script to track echo mention status usage [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/302127 (https://phabricator.wikimedia.org/T140928) (owner: Addshore)
[17:11:38] <elukey>	 all right team going afk!
[17:11:47] <elukey>	 have a good day/evening :)
[17:11:48] <elukey>	 o/
[17:13:28] <ottomata>	 laters!
[17:15:24] <nuria_>	 milimetric: let me know when you have time to look at asq deploy, will try to find your patch
[17:17:40] <milimetric>	 nuria_: oops, the patch must still be local, I don't see it in gerrit
[17:17:51] <nuria_>	 milimetric: ah, ok, could not find it either
[17:18:41] <nuria_>	 milimetric: ok, let me know when it is in gerrit
[17:18:44] <milimetric>	 nuria_: ... wtf, it didn't even create the patch at all actually
[17:18:52] <milimetric>	 it said it did when I ran that docker thing
[17:18:58] <nuria_>	 milimetric: did the build failed?
[17:19:01] <milimetric>	 no
[17:19:10] <milimetric>	 ok, I'm gonna get lunch :)
[17:19:33] <nuria_>	 milimetric: k, let me know when you are back
[17:19:38] <milimetric>	 I've gotta look at this unserialize bug too, I think that's higher priority, right?
[17:19:57] <milimetric>	 but if I'm not done by end of day, I'll switch back to this
[17:24:29] <ottomata>	 nuria_:  oook, here we go: https://gerrit.wikimedia.org/r/#/c/292049/
[17:24:50] <ottomata>	 nuria_:  that one is more urgent than the schema name change one, although it is alittle bigger
[17:25:00] <ottomata>	 i'd like to do a separate deploy for the schema name change
[17:25:12] <ottomata>	 since it requires a little bit of coordination.
[17:28:34] <nuria_>	 ottomata: ok, this one basically chnges the background library and adds the serialization function that -given encoding and python- might blow up and it's really nobody's fault
[17:32:53] <nuria_>	 ottomata: is there a puppet patch that goes with this one or how do you instantiate the kafka-python?
[17:38:59] <ottomata>	 nuria_:  for producer, no, because we are only upgrading the producer version to use the newer API
[17:39:01] <ottomata>	 it is hte same lib
[17:39:09] <ottomata>	 so producer URIs should remain the same, starting with kafka://
[17:39:28] <ottomata>	 hmm, actually, right
[17:39:39] <ottomata>	 this is al ittle more sensitive, since it does actually upgrade the producer to the new api
[17:39:44] <ottomata>	 so the new .deb version will need to be installed
[17:39:55] <ottomata>	 anyway, for consumer, there will need to be a puppet patch to switch to it
[17:40:13] <ottomata>	 but we don't want to switch yet, we just want to try it out in beta, and maybe in some analytics prod circumstances
[17:40:20] <ottomata>	 and we can do so conditionally, til we decide which one we like
[17:40:43] <ottomata>	 nuria_:  but for serialization, this doesn't actually change anything
[17:40:49] <ottomata>	 with the way things are being serialized
[17:40:58] <ottomata>	 but, kafka-python lets you specify a serialize function to the producer
[17:41:04] <ottomata>	 so it will auto serialize after you call produce()
[17:41:14] <ottomata>	 or uh, producer.send()
[17:41:15] <ottomata>	 or whatever it is
[17:41:27] <ottomata>	 instead of serializing ahead of time and then passing the bytes to the produce call
[17:45:42] <nuria_>	 ottomata: can we test this in beta?
[17:54:32] <nuria_>	 ottomata: in my vagrant after the pip install . eventbus does not start
[17:57:09] <nuria_>	 ottomata: the error (if I try to install it by hand)
[17:57:11] <nuria_>	 ottomata:
[17:57:14] <nuria_>	 https://www.irccloud.com/pastebin/AxbNDwUl/
[17:57:49] <nuria_>	 ottomata: ah wait , i ahem.... need to build
[18:07:34] <nuria_>	 ottomata: let me know when you are back
[18:18:12] <nuria_>	 ottomata: and ... can you start the service with upstart on vagrant?
[18:21:14] <ottomata>	 nuria_:  ja you need new version
[18:21:19] <ottomata>	 did you install the new version?
[18:21:23] <ottomata>	 probably
[18:21:35] <ottomata>	 if your venv is active
[18:21:44] <ottomata>	 pip install -e --upgrade .
[18:21:54] <ottomata>	 sorry , was chatting with toby
[18:21:58] <ottomata>	 brb too, need lunch, but i'm around
[18:23:30] <nuria_>	 ottomata: k, let's talk when you are back
[18:23:57] <nuria_>	 ottomata: i did install newest version of kafka-python
[18:24:11] <nuria_>	 ottomata: I am so confused between pykafka/kafka-python and aconfluent kafka
[18:24:15] <nuria_>	 ..
[18:24:21] <ottomata>	 haha
[18:24:30] <ottomata>	 nuria_:  lets chat real quick before i go so you can play
[18:24:33] <ottomata>	 batcave?
[18:24:38] <nuria_>	 k
[19:39:06] <wikibugs>	 Analytics-Wikistats, Operations, Patch-For-Review, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2512886 (Krinkle)
[19:44:31] <wikibugs>	 Analytics, Operations: stat1004 doesn't show up in ganglia - https://phabricator.wikimedia.org/T141360#2512900 (Dzahn) investigated a bit.  could confirm outgoing packets from stat1004 towards carbon (the aggregator for eqiad).. could NOT confirm incoming packets on carbon (unlike from stat1003 and other...
[19:48:51] <wikibugs>	 Analytics, Operations: stat1004 doesn't show up in ganglia - https://phabricator.wikimedia.org/T141360#2512910 (Dzahn) i don't understand how analytics roles are setup. "role::analytics_cluster::client" includes a bunch of other things and the word "firewall" or "base::firewall" does not show up in any o...
[19:50:02] <wikibugs>	 Analytics, Operations: stat1004 doesn't show up in ganglia - https://phabricator.wikimedia.org/T141360#2512911 (Dzahn) a:Dzahn>None
[19:54:58] <wikibugs>	 Analytics-Wikistats, Operations, Patch-For-Review, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2512943 (Krinkle) Until 4 months ago, this redirect existed. {0f5815e9b6} - https://gerrit.wikimedia.org/...
[19:55:05] <wikibugs>	 Analytics-Wikistats, Operations, Patch-For-Review, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2512944 (Krinkle) declined>Open
[19:55:22] <wikibugs>	 Analytics-Wikistats, Operations, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (Krinkle)
[20:21:26] <wikibugs>	 Analytics, Analytics-Wikistats, Operations, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2512993 (Dzahn)
[20:23:18] <wikibugs>	 Analytics, Analytics-Wikistats, Operations, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (BBlack) I tend to agree that if this was linked externally, we shouldn't have broken it.  I don't think...
[20:29:03] <wikibugs>	 Analytics, Operations: stat1004 doesn't show up in ganglia - https://phabricator.wikimedia.org/T141360#2513016 (Dzahn) Open>Resolved a:Dzahn eh.. yea.. after looking more, i restarted all aggregators on carbon (as in "kill" them and run puppet)  stat1004 showed up   https://ganglia.wikimedia....
[20:40:31] <wikibugs>	 Analytics, Pageviews-API: Pageview API Capacity Projections when it comes to storage - https://phabricator.wikimedia.org/T141789#2513028 (Danny_B)
[20:59:01] <milimetric>	 nuria_: you wanna take another look at the deploy?
[22:11:12] <ottomata>	 milimetric:  you still working?
[22:12:41] <milimetric>	 uh... yea
[22:13:00] <milimetric>	 what's up
[22:13:02] <milimetric>	 (I've just gotta go in like 20)
[22:13:38] <ottomata>	 hmm, nothing big, think i've got this async thing working, watned to run through it with you to see if it made sense
[22:13:55] <milimetric>	 sure, to the cave
[22:14:06] <ottomata>	 k!
[23:29:44] <wikibugs>	 Analytics-Wikimetrics, Continuous-Integration-Config: tox runs all tests (including manual ones) - https://phabricator.wikimedia.org/T71183#2513693 (greg) p:Low>Lowest >>! In T71183#2492178, @Nuria wrote: > @harshar: tests cannot be run from depo alone as they require a wikimetrics instance runni...