[07:26:04] <fdans>	 good mooooorning!
[08:30:57] <elukey>	 o/
[08:51:12] * elukey sees data loss warnings emails
[08:51:14] * elukey cries
[08:52:09] <elukey>	 fdans: hola.. if you feel opsy this morning we could check the data loss warning emails
[08:54:02] <joal>	 Hi lads
[08:54:41] <joal>	 elukey: Mwarf ... Data loss emails :(
[08:55:41] <fdans>	 elukey: yessssir, batcave whenever you want
[08:55:45] <fdans>	 helloooo joal!
[08:55:59] <joal>	 Morning fdans :)
[08:56:31] <joal>	 fdans: morning time, elukey will probably answer async while coffee gets into his blood ;)
[08:57:10] <fdans>	 oh yes of course
[09:00:53] <elukey>	 fdans: yes please, 30 mins and I'll be operative :D
[09:00:57] <elukey>	 (reading emails etc..)
[09:01:13] * joal and elukey are an old morning couple
[09:01:27] <elukey>	 but I thought it would have been a good "problem" to review since it touches basically most of our infrastructure :D
[09:01:31] <elukey>	 hahahaah
[09:01:35] <elukey>	 joal: <3
[09:53:06] <elukey>	 fdans: if you want we can chat about the alarms in here, so joal will read and probably correct me :D 
[09:53:29] <fdans>	 sure elukey!
[09:53:42] <elukey>	 do you have an idea about what's happening ?
[09:53:59] <elukey>	 I mean, very generic statement of the problem :)
[09:54:45] <fdans>	 no idea, except from an email you sent last week about presumably the same issue?
[09:54:55] <addshore>	 *waves* it would be great if someone could have a look at https://gerrit.wikimedia.org/r/369902! :D
[09:54:57] <addshore>	 (Puppet)
[09:55:01] <elukey>	 it should yes, but we need to verify it since it might be something different
[09:55:24] <fdans>	 right
[09:55:35] <addshore>	 modules/statistics/manifests/wmde/wdcm.pp:25 wmf-style: class 'statistics::wmde::wdcm' includes r_lang from another module :(
[09:55:58] <elukey>	 addshore: will check later on is it fine??
[09:56:04] <addshore>	 yup!
[09:56:19] <fdans>	 so how do we verify elukey 
[09:56:49] <elukey>	 fdans: wait a sec, let's review what is the main problem in here, then we can discuss about a plan :)
[09:57:05] <fdans>	 sounds good
[09:57:46] <elukey>	 so I am going to write an overview of how things are flowing from varnish to kafka, stop me if this is boring or if you know it already 
[09:58:53] <elukey>	 Varnish logs all the requests landing to each caching host in shared memory, not in a file (since they don't want to waste time doing I/O with disks)
[09:58:58] <fdans>	 not boring at all and I am deeply ignorant about it elukey  :)
[09:59:27] <elukey>	 and basically the architecture is load-balancers --> Varnish --> application servers (PHP/Mediawiki/Etc..)
[10:00:14] <fdans>	 that makes sense
[10:00:16] <elukey>	 now our dear Varnishkafka is a C daemon that runs on all the caching hosts (they are called cp[1234]*)
[10:01:07] <elukey>	 it reads from shared memory, interprets the data, formats it in a json hash (we configure its format) and then it sends everything to kafka in batches
[10:01:39] <elukey>	 then Camus pulls data from Kafka to HDFS periodically, but producer and consumer do not have to coordinate thanks to kafka
[10:01:44] <elukey>	 it happens all asyncronously
[10:01:57] <elukey>	 (feel free to ask questions anytime)
[10:02:27] * fdans is processing
[10:03:31] <fdans>	 I'm not sure what part does Camus play here elukey 
[10:03:48] <elukey>	 it doesn't sorry, it was only to complete the picture
[10:04:18] <fdans>	 okok
[10:04:41] <elukey>	 not true actually, it plays a little part
[10:04:58] <elukey>	 anyhow, at the Camus level we bucket data per hour
[10:05:08] <elukey>	 to have hourly partitions on hdfs etc..
[10:05:19] <elukey>	 but Camus afaik runs every 10 mins in a cron
[10:05:22] <elukey>	 so very often
[10:05:30] <elukey>	 now the alarm says
[10:05:43] <elukey>	 "54 requests (0.0% of total) have incomplete records. 2123099 requests (1.565% of valid ones) were lost."
[10:05:48] <elukey>	 (in the attached file)
[10:06:11] <elukey>	 that looks really horrible
[10:06:37] <elukey>	 so "incomplete records" are a side effect of how Varnishkafka works
[10:06:57] <fdans>	 just to be clear, I don't have that email right?
[10:07:14] <elukey>	 you should, it went to analytics-alerts@
[10:07:22] <elukey>	 maybe you are not in that mailing list
[10:07:45] <fdans>	 I think I'm not 🙈
[10:07:57] <elukey>	 ahhh you are not indeed
[10:08:02] <elukey>	 let me add you
[10:08:13] <elukey>	 I am going to forward the email to you
[10:08:24] <fdans>	 awesome
[10:09:29] <fdans>	 got em
[10:10:22] <fdans>	 looking at the attached message elukey 
[10:10:27] <fdans>	 yeah that looks pretty bad
[10:12:04] <fdans>	 those "valid requests", are they that hourly bucket?
[10:12:25] <elukey>	 !log added Francisco to the analytics-alerts@ mailing list
[10:12:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:13:03] <fdans>	 I mean, assuming the data loss is real, that message is saying that out of all the requests in an hour, we've lost 1.565% right?
[10:13:17] <elukey>	 exactly
[10:13:49] <elukey>	 varnishkafka (for each caching host) holds a counter that we call "sequence" in webrequest logs
[10:14:25] <elukey>	 that is incremented when varnishkafka starts processing a request in the shared memory log
[10:15:30] <elukey>	 we have checks that are verifying that in each hour, there are no "holes" between sequence numbers
[10:15:46] <fdans>	 right
[10:16:15] <elukey>	 incomplete records are due to how varnishkafka reads from shared memory
[10:16:35] <elukey>	 basically Varnish logs "tags" related to the request being processed in shm
[10:16:38] <elukey>	 like
[10:16:41] <elukey>	 BeginTimestamp: X
[10:16:45] <elukey>	 URI: Y
[10:16:46] <elukey>	 etc..
[10:17:14] <elukey>	 and eventually, if all is good, it logs a "end timestamp"
[10:17:29] <fdans>	 got it
[10:17:31] <elukey>	 that is what we add to the 'dt' field of the webrequest log
[10:18:02] <elukey>	 varnishkafka does not wait indefinitely for this end timestamp, so sometimes it happens that it times out
[10:18:25] <elukey>	 and we get records with dt:'-' in the webrequest logs
[10:18:33] <elukey>	 those are the "incomplete" ones
[10:18:54] <elukey>	 now the tricky part
[10:18:57] <elukey>	 all good up to now?
[10:19:00] <fdans>	 ooooo right
[10:19:09] <fdans>	 yes yes
[10:19:42] <elukey>	 so we had a problem in the past due to those incomplete records, since they are sneaky
[10:20:08] <elukey>	 say a request starts a couple of minutes before the end of the hour
[10:20:17] <elukey>	 18:58 for example
[10:20:37] <fdans>	 ok
[10:20:52] <elukey>	 and for some reason (client super slow, etc..) it causes a timeout later on
[10:21:06] <elukey>	 there is the risk that it gets "bucketed" in the next hour
[10:21:17] <elukey>	 because the dt field is '-'
[10:21:24] <elukey>	 so camus tries to do its best 
[10:21:46] <elukey>	 this caused a lot of troubles and basically the whole team become crazy to figure out this mess
[10:22:13] <elukey>	 if one of those records gets bucketed in the wrong our, it adds a sequence number that is not consistent with the other ones
[10:22:37] <elukey>	 so our checks detects huge holes 
[10:22:47] <elukey>	 that are false alarms, because of mis-bucketing
[10:23:22] <fdans>	 I seeee, gotcha
[10:24:04] <elukey>	 since Marcel is a patient and great man, he worked arount this issue creating smart alarms
[10:25:42] <elukey>	 that are not counting the records with dt:'-' field
[10:26:01] <elukey>	 only the "incomplete records" metric counts them
[10:26:35] <elukey>	 so we have two separate metrics to judge the data loss
[10:26:52] <elukey>	 this is why when we received last week the email we freaked a bit 
[10:26:53] <elukey>	 :D
[10:27:07] <fdans>	 of course
[10:27:12] <elukey>	 so one thing that I checked was https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1
[10:27:29] <elukey>	 in which we have metrics about data loss between varnishkafka and kafka
[10:27:39] <joal>	 elukey: I'd love to see that morning writing copy-pasted  onto a wiki page :)
[10:28:34] <elukey>	 it is definitely a good candidate for our on-call checklist
[10:29:17] <elukey>	 fdans: so kafka is faily resilient to network issues or even broker going down
[10:29:26] <elukey>	 and as you can see there is no clear indication of data loss
[10:29:37] <elukey>	 that is great since it is a positive news :D
[10:29:56] <elukey>	 (let me know when you are ready for the next part)
[10:30:31] <fdans>	 what would indicate data loss here elukey ? these graphs are pretty rockandroll
[10:30:50] <elukey>	 ahhaha sorry you are right
[10:31:24] <elukey>	 so the first ones are basically the rate of enqueueing to kafka
[10:31:28] <elukey>	 from the caching hosts
[10:31:46] <elukey>	 unless you see a line that drops down to zero and stays there it means that data is flowing
[10:32:07] <elukey>	 moreover in the end you have data send errors graphs
[10:32:10] <elukey>	 (the bottomo ones)
[10:32:15] <elukey>	 *bottom
[10:32:50] <fdans>	 gotcha
[10:32:53] <elukey>	 if you see spikes in there it means that either retransmission or data drop might have happen (iirc we retry three times before giving up, it is the default behavior of librdkafka)
[10:33:09] <elukey>	 ah varnishkafka uses librdkafka to push data to kafka
[10:33:18] <elukey>	 so if you hear it mentoned often this is why
[10:33:32] <elukey>	 back to the data loss email
[10:34:09] <elukey>	 Another great man enters in the story now, Joseph :D
[10:34:18] <fdans>	 yessss
[10:34:33] * fdans reels with excitement about what comes next
[10:34:48] <elukey>	 he thought about a new "incarnation" of the above issue
[10:35:26] <elukey>	 so misbucketed requests that successfully get a dt: field
[10:35:27] <fdans>	 brb toilet!
[10:35:48] <elukey>	 but that get misbucketed in the wrong hour because they take too long to complete
[10:37:34] <fdans>	 aha
[10:39:26] <elukey>	 so last time I came up with https://phabricator.wikimedia.org/P6203 to double check
[10:39:56] <elukey>	 the first part of the output comes from a script that we have in refinery iirc, that tries to find the sequence holes
[10:40:55] <elukey>	 (host nomenclature: ulsfo == SF datacenter, eqiad == Virginia, codfw == Dallas, esams == Amsterdam)
[10:41:40] <fdans>	 yep
[10:44:11] <elukey>	 so the output is nice since it shows two things
[10:44:37] <elukey>	 some lines don't have either dt_before_missing or dt_after_missing
[10:45:19] <elukey>	 but some of them do have both but still a ton of missing sequences (like 400k etc..)
[10:45:42] <elukey>	 so picked one random missing sequence from those "holes" and I checked if it was in the hour right after it
[10:45:48] <elukey>	 and indeed it was :)
[10:45:53] <fdans>	 yayy
[10:46:13] <elukey>	 this was last week
[10:46:21] <elukey>	 now we should check the same thing
[10:46:33] <elukey>	  and possibly figure out a way to avoid all these false alarms
[10:47:09] <elukey>	 the script that I've used is in ./hive/webrequest/select_missing_sequence_runs.hql in refinery
[10:47:12] <elukey>	 if you want to check it out
[10:47:32] <elukey>	 the other queries are in the phab paste
[10:47:46] <fdans>	 let's see
[10:48:02] <fdans>	 entering stat1003
[10:49:42] <joal>	 elukey: Thinking of that: a nice way to prevent this issue would be to use start_timestamp as dt instead of end_timestamp in VK - Not sure how it would impact data though
[10:50:04] <fdans>	 elukey: it's been a while since I last entered and it's asking me for a password...
[10:51:08] <elukey>	 fdans: --verbose
[10:51:09] <elukey>	 :D
[10:51:14] <elukey>	 ahhh stat1003
[10:51:19] <elukey>	 yeah it is decommed
[10:51:21] <elukey>	 :P
[10:51:31] <elukey>	 stat1004/5/6 are now the right ones
[10:51:35] <elukey>	 1004 would be ok
[10:51:50] <fdans>	 aaaahhh that's right!
[10:51:58] <elukey>	 joal: I think that when we were dealing with the dt:'-' we had a chat about using the start timestamp
[10:52:13] <elukey>	 but then after a long brainstorming with Andrew there was an explanation about why it was not good
[10:52:48] <joal>	 elukey: I think so as well - Issue with using start was, IIRC that we would consider correctly lines without end_timestamps
[10:53:06] <joal>	 maybe there was something lse
[10:54:40] <elukey>	 ah yes that was one of the issues 
[10:58:36] <joal>	 elukey: another potential issue is hat we decorellate more-or-less strongly production time from event-date time, which is not super nice for potential streaming systems
[10:59:32] <wikibugs>	 (03PS1) 10GoranSMilovanovic: minor Shiny portal change [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/387212
[11:00:06] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 032 C: 032] minor Shiny portal change [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/387212 (owner: 10GoranSMilovanovic)
[11:01:52] <wikibugs>	 (03PS1) 10GoranSMilovanovic: minor deletion [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/387213
[11:02:08] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 032 C: 032] minor deletion [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/387213 (owner: 10GoranSMilovanovic)
[11:03:36] <fdans>	 elukey: dumb question, but the time to check is utc right?
[11:03:48] <elukey>	 correct! 
[11:03:54] <elukey>	 all UTC, so much better
[11:04:16] * elukey remebers $other_workplace in which this was not true, what a horrible world
[11:07:32] <fdans>	 hmmmm elukey I'm getting a java.io.EOFException: hdfs://analytics-hadoop/tmp/hive/fdans/ac20598d-5f23-49a4-990b-149dd150434c/hive_2017-10-30_11-04-15_401_908351781413837276-2/-mr-10004/b878675e-500c-4bdf-9188-890c01f7ea29/emptyFile not a SequenceFile
[11:08:09] <fdans>	 https://www.irccloud.com/pastebin/ELzKKb5N/
[11:08:35] <elukey>	 -d webrequest_source=bits
[11:08:47] <elukey>	 that is a bit old :)
[11:08:55] <elukey>	 now we only have upload/text/misc
[11:09:07] <fdans>	 do I remove that?
[11:11:28] <fdans>	 elukey: ˆ:)
[11:12:03] <elukey>	 nono just use the correct one.. in each email it is indicated a different segment
[11:32:25] <elukey>	 fdans: going out for lunch but I'll brb in 30 mins :)
[11:34:46] * elukey lunch!
[11:40:13] <joal>	 fdans: text/upload/misc (and before, bits and maps) are webrequest_source partition values, defining the webrequest_source partition and the folder names used to partition the data (in format webrequest_source=text for instance)
[11:41:10] <joal>	 fdans: In today's alert email, it is said: Data Loss Warning - Workflow webrequest-load-check_sequence_statistics-wf-upload-2017-10-30-0 <-- The upload part tells you about that :)
[11:48:44] <fdans>	 yesss thank you joal after a bit of perusing I found that :)
[11:48:55] <fdans>	 but your explanation is more clarifying
[12:03:26] <elukey>	 I am back!
[12:03:43] <elukey>	 fdans: sorry I forgot to explain that part :(
[12:04:04] <fdans>	 nono np! :)
[12:07:54] <fdans>	 elukey: quick lunch and I'll be back!
[12:09:52] <elukey>	 sure!
[12:23:09] <elukey>	 best cronspam of the day
[12:23:10] <elukey>	 error: error renaming /var/log/refinery/sqoop-mediawiki.log.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.4 to /var/log/refinery/sqoop-mediawiki.log.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.
[12:23:17] <elukey>	 1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.5
[12:30:09] <wikibugs>	 10Analytics, 10EventBus, 10Services (doing): PHP put of memory error trying to log big events - https://phabricator.wikimedia.org/T179280#3719166 (10Pchelolo)
[12:30:37] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3709298 (10Pchelolo) @Ottomata pushing > 100 megs into log stash? Don't think that's a good idea.  Actually digging a bit deeper, we're indeed trying to log these giant event...
[12:40:45] <fdans>	 elukey hive is complaining when looking for the sequence
[12:40:59] <fdans>	 https://www.irccloud.com/pastebin/xrkaXFy3/
[12:41:57] <elukey>	 ah the script might be missing the serde then 
[12:41:59] <elukey>	 lemme check
[12:42:27] <elukey>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#FAQ
[12:42:59] <elukey>	 fdans: --^
[12:43:36] <fdans>	 oooooo thanks!
[12:43:42] <elukey>	 no weird it shows in ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
[12:44:00] <elukey>	 ahhh you are not using the script
[12:44:02] <elukey>	 okok sorry
[12:44:12] <elukey>	 I didn't read the whole paste
[12:44:16] <elukey>	 okok then you'd need the ADD JAR
[12:44:19] <fdans>	 nono I'm with the next step
[12:44:25] <fdans>	 yeah :)
[12:45:22] <fdans>	 elukey: hmmm, no rows for the sequence
[12:45:36] <fdans>	 do you mind going into the cave for a second?
[12:46:57] <elukey>	 1 min, I am fixing some logs :(
[12:49:25] <elukey>	 elukey@stat1005:/var/log/refinery$ ls -lh
[12:49:25] <elukey>	 total 1.6G
[12:49:25] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 122M Oct 30 12:01 eventlogging_refine_test.log
[12:49:28] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 141M Oct 29 06:01 eventlogging_refine_test.log.1.1
[12:49:31] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 194M Oct 27 06:01 eventlogging_refine_test.log.1.1.1.1
[12:49:34] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 182M Oct 25 06:03 eventlogging_refine_test.log.1.1.1.1.1.1
[12:49:37] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 193M Oct 23 06:01 eventlogging_refine_test.log.1.1.1.1.1.1.1.1
[12:49:40] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 189M Oct 21 06:03 eventlogging_refine_test.log.1.1.1.1.1.1.1.1.1.1
[12:49:43] <joal>	 man ... That is wrong
[12:49:43] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 192M Oct 19 06:01 eventlogging_refine_test.log.1.1.1.1.1.1.1.1.1.1.1.1
[12:49:46] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 194M Oct 17 06:01 eventlogging_refine_test.log.1.1.1.1.1.1.1.1.1.1.1.1.1.1
[12:49:49] <elukey>	 -rw-r--r-- 1 hdfs analytics-admins 156M Oct 15 06:05 eventlogging_refine_test.log.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1
[12:49:52] <elukey>	 this is art
[12:49:55] <elukey>	 ahhahahah
[12:49:58] <joal>	 :d
[12:50:12] * joal tries to find a meaning
[12:50:37] * joal thinks the final number of '.1' will be 42
[12:53:34] <elukey>	 ok fixed :D
[12:53:41] <elukey>	 but I loved the output
[12:54:42] <elukey>	 fdans: ok ready for the cave
[12:54:53] <fdans>	 I'm in!
[12:58:42] <wikibugs>	 10Analytics, 10EventBus, 10Services (doing): PHP out of memory error trying to log big events - https://phabricator.wikimedia.org/T179280#3719262 (10Pchelolo)
[13:08:05] <addshore>	 joal: if you have a chance could you have a brief look at https://phabricator.wikimedia.org/T177257 ?
[13:08:16] <addshore>	 I just CCed you!
[13:08:46] <joal>	 Hey addshore, will do !
[13:59:19] <wikibugs>	 (03PS2) 10Fdans: Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461)
[14:00:41] <wikibugs>	 (03CR) 10Fdans: [C: 032] [WIP] Clean up UI and aggregations [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084) (owner: 10Milimetric)
[14:01:12] <fdans>	 assuming it isn't a WIP anymore milimetric ?
[14:02:27] <milimetric>	 fdans: there’s that todo in there and I have to run over Erik’s initial list one more time.  Nothin* else needs code review though, so I can self merge the rest
[14:03:11] <fdans>	 milimetric: aghh sorry then, I merged that change
[14:03:42] <milimetric>	 sok, fdans, I’ll do another
[14:04:18] <wikibugs>	 (03CR) 10Fdans: Only query breakdowns when they are to be visualised (037 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans)
[14:09:30] <wikibugs>	 (03PS3) 10Fdans: Use only js Date objects internally [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385186 (https://phabricator.wikimedia.org/T178461)
[14:09:32] <wikibugs>	 (03PS3) 10Fdans: Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461)
[14:13:01] <wikibugs>	 (03PS4) 10Fdans: Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461)
[14:18:38] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742#3719606 (10Cmjohnson) a:03RobH this server is out of warranty by 6 months.  Assigning to @robh to determine if  we should order a new one?
[14:19:26] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3719610 (10Cmjohnson) a:03RobH this server is out of warranty by 6 months.  Assigning to @robh to determine if  we should order a new one...probably two.
[14:26:31] <milimetric>	 fdans: what do you think about subtle year markers on the widget line graphs?
[14:27:07] <milimetric>	 maybe like along the bottom just a little tick mark and the year?
[14:27:32] <milimetric>	 I think it would make the "year over year" thing much more useful
[14:30:03] <fdans>	 Yeah agreed milimetric
[14:31:23] <milimetric>	 fdans: also, just reviewed the breakdowns change, looks almost ready to go.  One minor thing.  In the style, if you have multiple breakdowns, there's more white space between the checkbox list and the breakdown button than between the button and the next breakdown.  That makes the button look like it belongs with the wrong list
[14:31:54] <milimetric>	 maybe switch the whitespace and add one of those subtle lines under it
[14:32:03] <fdans>	 ah yes, forgot about that
[14:32:10] <fdans>	 lemme do and push that now
[14:32:26] <milimetric>	 fdans: actually wait
[14:32:28] <milimetric>	 forget it
[14:32:39] <milimetric>	 let's merge and I'll do all this on the pass-over I'm doing this week
[14:32:51] <milimetric>	 because there's other little things like cursor styles and stuff
[14:33:46] <fdans>	 cool
[14:34:19] * fdans makes a coffee before pushing
[14:45:55] <joal>	 taking a break a-team
[14:46:32] <elukey>	 milimetric: o/
[14:49:24] <milimetric>	 hi elukey 
[14:51:20] <joal>	 a-team - I just realized that time-chaging in europe this week has moved standup an hour early - Meaning I'll miss it today
[14:51:24] <joal>	 I'm vbery sorry for that
[14:51:30] <milimetric>	 it's fine joal 
[14:51:48] <milimetric>	 daylight savings is  the worst
[14:52:06] <joal>	 Today I continued looking onto data quality issues, no huge findings yet
[14:52:11] <joal>	 later !
[14:52:47] <elukey>	 milimetric: qq for you if you have time - atm I am investigating an issue in the job queues
[14:52:57] <elukey>	 commons keeps inserting jobs like https://www.mediawiki.org/wiki/Manual:RefreshLinks.php
[14:53:34] <elukey>	 now I am trying to track down what it is causing it, that might have been a huge template change or a bot 
[14:53:39] <elukey>	 or whatever else
[14:54:13] <elukey>	 do you have any idea / suggestion about how to verify, say for a given page in commons, if/when a refreshlink happened and what it changed?
[14:57:55] <elukey>	 this is what it does https://github.com/wikimedia/mediawiki/blob/master/includes/jobqueue/jobs/RefreshLinksJob.php
[14:58:01] <milimetric>	 sorry, not getting irc pings
[14:58:07] * milimetric thinking
[15:02:42] <ottomata>	 mforns:  joal standup?
[15:02:50] <ottomata>	 no ignore jo al
[15:02:54] <mforns>	 ottomata, oh!
[15:13:46] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3719839 (10pmiazga) @mforns - We're missing one very important bit. You're showing us results of `user_properties` table query...
[15:27:52] <halfak>	 o/ ottomata 
[15:27:52] <halfak>	 https://www.mediawiki.org/w/index.php?title=Topic:Tzvxe3sckmfoftp8&topic_showPostId=u0zq14ty14khjfj9#flow-post-u0zq14ty14khjfj9
[15:28:09] * halfak draws pictures of his "state validator" idea for a single-source sytem. 
[15:31:57] <wikibugs>	 (03PS4) 10Fdans: Use only js Date objects internally [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385186 (https://phabricator.wikimedia.org/T178461)
[15:36:38] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3709298 (10mobrovac) Obviously, we need to get rid of such huge jobs, so one temporary solution that ought to be safe enough is to use the EventBus service to log to a file a...
[15:36:55] <wikibugs>	 10Analytics-Kanban: Locate data from /srv on stat1003 - https://phabricator.wikimedia.org/T179189#3719923 (10Ottomata)
[15:38:33] <wikibugs>	 10Analytics-Kanban, 10Pageviews-API: Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3719925 (10Ottomata) p:05Triage>03Normal a:03Milimetric
[15:44:23] <wikibugs>	 (03CR) 10Milimetric: [C: 032] Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans)
[15:44:51] <wikibugs>	 (03CR) 10Milimetric: [C: 032] Use only js Date objects internally [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385186 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans)
[15:45:20] <wikibugs>	 (03CR) 10Milimetric: [V: 032 C: 032] Use only js Date objects internally [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385186 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans)
[15:45:35] <wikibugs>	 (03CR) 10Milimetric: [V: 032 C: 032] Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans)
[15:45:47] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3719952 (10Ottomata) > Obviously, we need to get rid of such huge jobs, so one temporary solution that ought to be safe enough is to use the EventBus service to log to a file...
[15:46:54] <ottomata>	 oh mforns you got a quick second for a naming brain bounce for cergen stuff?
[15:47:04] <mforns>	 ottomata, sure!
[15:47:06] <ottomata>	 bc
[15:47:23] <ottomata>	 oh its taken!
[15:47:25] <ottomata>	 -2
[15:47:28] <mforns>	 ok
[15:50:59] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3719988 (10Pchelolo) > logging this from MediaWiki (even if just the meta) is more appropriate.   Let's see where it gets us when my [[ https://gerrit.wikimedia.org/r/#/c/387...
[15:59:29] <wikibugs>	 (03PS5) 10Fdans: Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461)
[15:59:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans)
[16:07:27] <mforns>	 elukey, can you review and merge this EL white-list patch please?? I +1'd it
[16:08:21] <elukey>	 mforns: will do after ops meeting!
[16:09:18] <mforns>	 elukey, thaaaanks! :]
[16:27:18] <joal>	 Hey milimetric
[16:27:52] <milimetric>	 what's up joal 
[16:28:08] <joal>	 milimetric: I could go with some brain juice if you have time
[16:28:25] <milimetric>	 ok, one minute and I'll be in cave
[16:28:30] <joal>	 Thanks a lot
[16:29:08] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3720108 (10RobH) a:05RobH>03Cmjohnson @Cmjohnson: Do we have any power supplies on already decommissioned hardware that would fit in the system with the failed powe...
[16:37:13] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3720160 (10mforns) @pmiazga  Oh, understand! I didn't know that. Thanks for the explanation. Then yes, you're totally right, `s...
[16:39:25] <elukey>	 milimetric: https://www.mediawiki.org/wiki/Manual:RefreshLinks.php contains a better explanation about what they do.. I thought it was related to wikitext but it is not
[16:46:53] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3720185 (10Ottomata) > We don't actually need any additional logic Oh hm, that is probably true.    How often do these events happen?  I don't want to fill up local disks wit...
[17:35:28] <wikibugs>	 (03Abandoned) 10Fdans: Only query breakdowns when they are to be visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/385954 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans)
[17:38:30] * elukey off!
[17:58:20] <fdans>	 milimetric: do you have a moment to cave?
[17:58:33] <milimetric>	 sure! :)
[17:58:43] <milimetric>	 there fdans 
[18:24:25] <wikibugs>	 (03PS1) 10Milimetric: Query breakdowns only when they are visualised [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387320 (https://phabricator.wikimedia.org/T178461)
[18:24:56] <wikibugs>	 (03CR) 10Milimetric: [V: 032 C: 032] "fdans actually did this, I just stole credit" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387320 (https://phabricator.wikimedia.org/T178461) (owner: 10Milimetric)
[18:32:43] <joal>	 milimetric: I have an EXPLANATION !
[18:33:00] * joal feels the urge to shout some - SORRY TEAM
[18:36:07] <wikibugs>	 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (doing): PHP out of memory error trying to log big events - https://phabricator.wikimedia.org/T179280#3720563 (10mobrovac) p:05Triage>03High
[19:33:02] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations: Investigate listing the "Onboarding New Developers" KPIs on a custom dashboard - https://phabricator.wikimedia.org/T179329#3720791 (10Aklapper)
[19:34:38] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations: Find a way to publish DB index names, to allow anyone to construct more complex queries based on certain indeces - https://phabricator.wikimedia.org/T179330#3720800 (10Aklapper)
[21:03:40] <joal>	 milimetric: Have aminute before I go to bed to listen to my findins?