[00:09:35] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) Thanks for looking into that that!  About TCP flags, if you filter with `IP proto: tcp` there should not be any null. It's expected that other IP...
[01:36:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) @ayounsi if you can provide a map like:                     map: {"32":"URG", "16":"ACK","8":"PSH", "4":"RST", "2":"SYN", "1":"FIN"}   we can do th...
[02:17:01] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) Adding the following to the ones you listed should cover most of the cases: > {'49': 'ACK+URG+FIN', '48': 'ACK+URG', '40': 'PSH+URG', '36': 'RST+...
[06:24:01] <wikibugs>	 10Analytics-Kanban: Test if Hue can run with Python3 - https://phabricator.wikimedia.org/T233073 (10elukey) From upstream, it seems that the fixed version of Hue (4.3) will only be available in CDH6.
[06:27:48] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey)
[06:27:57] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) a:03elukey
[06:28:22] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Create a spicerack recipe to reboot the hadoop worker nodes - https://phabricator.wikimedia.org/T225297 (10elukey)
[06:34:26] <elukey>	 morningggg
[06:34:36] <elukey>	 the script to roll reboot hadoop seems working fine
[06:35:06] <elukey>	 it reboots in batches (non-hdfs-journal-nodes first) and if one reboot fails, it asks for confirmation before proceeding
[06:35:15] <elukey>	 then it reboots journal node hosts one at the time
[06:35:23] <elukey>	 (this is only for workers)
[06:35:48] <elukey>	 for every batch, it
[06:35:54] <elukey>	 1) stops yarn and waits 10 minutes
[06:35:58] <elukey>	 2) stop hdfs datanote
[06:36:03] <elukey>	 3) in case, stop the journalnode
[06:36:06] <elukey>	 4) reboot
[06:36:23] <elukey>	 for every *host* in the batch, it
[06:36:46] <elukey>	 I think it is finally ready now
[07:03:42] <elukey>	 https://www.datanami.com/2019/01/10/cloudera-unveils-cdp-talks-up-enterprise-data-cloud/
[07:03:48] <elukey>	 sigh
[07:03:55] <elukey>	 that complicates our plans for CDH6
[07:04:16] <elukey>	 or better, it might mean upgrading to 6 and then to something else soon
[07:30:28] <wikibugs>	 (03CR) 10Fdans: [V: 03+2 C: 03+2] Add fake test data for mediarequests per file [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/536556 (owner: 10Fdans)
[07:30:50] <elukey>	 look fdans that starts working without even say hello
[07:31:46] <fdans>	 OH SORRY LUCA HOW ARE YOU HOPE YOU'RE HAVING A MOST MAGICAL DAY
[07:32:01] <elukey>	 hahahahahaah
[07:32:24] <elukey>	 a "good morning" would have been enough Francisco but I appreciate the effort :P
[07:33:23] <fdans>	 elukey: I saw your question on the hue repo the yesterday
[07:33:45] <fdans>	 you could send a PR titled "START OVER" that just deletes the entire project
[07:36:02] <elukey>	 ahahahhah
[07:36:11] <elukey>	 yes a really polite enquiry
[07:56:48] <wikibugs>	 10Analytics, 10Operations, 10User-Elukey: setup/install eqiad kerbos node WMF5173 - https://phabricator.wikimedia.org/T233141 (10elukey) a:05elukey→03RobH Thanks a lot!  Hostnames: krb1001 krb2001 (already updated the naming conventions in wikitech) Internal subnet, no Analytics VLAN raid1 is good enough
[07:57:19] <wikibugs>	 10Analytics, 10Operations, 10User-Elukey: setup/install codfw kerbos node WMF6577 - https://phabricator.wikimedia.org/T233142 (10elukey) a:05elukey→03RobH Thanks a lot!  Hostnames: krb1001 krb2001 (already updated the naming conventions in wikitech) Internal subnet, no Analytics VLAN raid1 is good enough
[08:27:10] <elukey>	 just sent an email to the team about the recent hadoop discoveries
[08:27:19] <elukey>	 I see a lot of joy for us in the future
[08:27:20] <elukey>	 sigh
[08:47:05] <wikibugs>	 (03PS3) 10Fdans: Add aggregate mediarequests per referer endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537114 (https://phabricator.wikimedia.org/T232857)
[08:47:40] <wikibugs>	 (03CR) 10Fdans: [V: 03+1] "@joal this is ready to merge before deploying aqs if ok with you" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537114 (https://phabricator.wikimedia.org/T232857) (owner: 10Fdans)
[08:54:34] <wikibugs>	 (03PS1) 10Fdans: Add mediarequests per referer to fake data script [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537601 (https://phabricator.wikimedia.org/T232857)
[09:39:00] <wikibugs>	 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10elukey) @ayounsi should we keep this task open?
[10:09:48] * elukey errand + lunch!
[10:21:00] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) 05Resolved→03Open Thanks, I can ssh into production servers. However, I cannot access SWAP following [[ https://wikitech.wikimedia.org/wiki/...
[12:02:08] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "The test job launched by @mforns succeeded. Merging!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/528504 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns)
[12:04:49] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537601 (https://phabricator.wikimedia.org/T232857) (owner: 10Fdans)
[12:05:37] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537114 (https://phabricator.wikimedia.org/T232857) (owner: 10Fdans)
[12:06:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add spark job to create mediawiki history dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/528504 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns)
[12:06:37] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/530002 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns)
[12:06:43] <mforns>	 thanks joal!!
[12:07:05] <joal>	 no prob mforns :)
[12:07:18] <joal>	 Preparing this evening full deploy - will release new jar now
[12:09:48] <wikibugs>	 (03PS1) 10Joal: Bump changelog to v0.0.100 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/537632
[12:09:57] <joal>	 mforns: --^ please if you have a minute :)
[12:12:55] <wikibugs>	 (03PS1) 10Joal: Bump webrequest-load jar version to v0.0.100 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854)
[12:13:17] <joal>	 This one could be double-checked as well mforns :)
[12:15:39] <elukey>	 joal: o/
[12:15:44] <joal>	 Hi elukey :)
[12:15:47] <joal>	 Siesta time ;)
[12:15:48] <elukey>	 did you see my lovely link about cloudera?
[12:15:54] <elukey>	 siestaaaaa
[12:15:55] <joal>	 I did
[12:16:17] <joal>	 elukey: will bigtop become a less than second class alternative?
[12:16:38] <joal>	 elukey: I have info for you about notebookerbs
[12:16:40] <elukey>	 no idea, it depends by a lot of things, but it is a mess :(
[12:16:49] <joal>	 it is indeed :(
[12:20:18] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10elukey) @MGerlach done! Please take a moment to review https://wikitech.wikimedia.org/wiki/LDAP/Groups#wmf_group, since your account is now able to see a...
[12:23:52] <joal>	 looks like mforns is gone - elukey would you mind? https://gerrit.wikimedia.org/r/537632
[12:24:54] <joal>	 elukey: notebookerbs work great with a kinit, except for hive access (I was kinda expecting that)
[12:26:31] <joal>	 oh actually elukey - my superbad
[12:26:32] <elukey>	 joal: ah yes we'll probably need to fix that
[12:26:39] <elukey>	 oh no mmm
[12:26:46] <elukey>	 in theory if it doesn't use JDBC it should work
[12:26:47] <joal>	 elukey: I think the problem was pbcak
[12:26:47] <elukey>	 in theory
[12:27:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "100!! We should celebrate!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/537632 (owner: 10Joal)
[12:27:36] <joal>	 Maybe at some point we'll go to 0.1.0 :)
[12:27:54] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/537632 (owner: 10Joal)
[12:39:11] <joal>	 !log Release refinery-source v0.0.100 to archiva
[12:39:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:49:46] <wikibugs>	 (03PS2) 10Joal: Bump webrequest-load jar version to v0.0.100 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854)
[12:50:17] <joal>	 elukey: if you don't mind please :)
[12:50:21] <joal>	 --^
[12:50:33] <elukey>	 all the failures in alerts@ are you joal?
[12:50:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Bump webrequest-load jar version to v0.0.100 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854) (owner: 10Joal)
[12:51:03] <joal>	 elukey: not supposed to !
[12:51:09] <elukey>	 also I have stuff to review/merge before deploying refinery :)
[12:51:12] <joal>	 elukey: the jenkins ones yes, the other ones not
[12:51:21] <joal>	 elukey: ack !!!
[12:53:20] <elukey>	 mmmm seems a problem while contacting the hive 2 server
[12:53:31] <joal>	 yes
[12:53:33] <elukey>	 joal: ok for me to restart the webrequest job?
[12:53:38] <joal>	 please
[12:53:43] <joal>	 I was looking at those as well
[12:54:39] <elukey>	 !log re-run webrequest-load upload/text for hour 11 due to transient hive server socket failures
[12:54:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:54:49] <joal>	 elukey: this is bizarre :(
[12:55:26] <joal>	 elukey: hole in metrics as well at the time
[12:55:42] <joal>	 no bueno
[12:56:19] <joal>	 elukey: about notebookerbs I disconfirm me having problems - all seems working great :)
[12:56:24] <joal>	 Sorry for bothering :)
[12:56:31] <joal>	 Will update the page
[12:56:42] <elukey>	 I am super happy, thanks for the tests!
[12:57:18] <elukey>	 ● hive-server2.service - LSB: Hive Server2
[12:57:18] <elukey>	    Loaded: loaded (/etc/init.d/hive-server2; generated; vendor preset: enabled)
[12:57:21] <elukey>	    Active: active (exited) since Wed 2019-09-18 12:38:09 UTC; 18min ago
[12:57:24] <elukey>	 18min ago
[12:57:32] <joal>	 Actually one thing is not great: no meaningfull error message when failing to start a kernel due to kerb errors
[12:58:21] <elukey>	 metastore also restarted
[12:58:28] <joal>	 Thanks elukey 
[12:58:32] <joal>	 I really wonder :(
[12:58:53] <elukey>	 weird
[12:59:27] <joal>	 Wow - weird error message on wikitech
[12:59:42] <elukey>	 wikitech?
[13:25:49] <joal>	 elukey: pinged on ops chan, and created a task (https://phabricator.wikimedia.org/T233215)
[13:26:14] <joal>	 elukey: I updated your page: 
[13:26:19] <joal>	 https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster
[13:26:40] <joal>	 at the end - some things could/should be better I guess - Particularly in reporting errors
[13:26:57] <joal>	 elukey: let me know when ready for merge/deploy on refinery
[13:27:47] <elukey>	 so the code is https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/537255/
[13:27:53] <elukey>	 it needs marcel's +1 
[13:27:57] <elukey>	 mforns: you there? :)
[13:28:07] <joal>	 ah right elukey - forgot that
[13:28:54] <elukey>	 is it blocking you?
[13:28:58] <joal>	 elukey: any idea of what caused the hive-server hiccup?
[13:29:13] <elukey>	 not really, didn't find anything in the logs
[13:29:17] <joal>	 elukey: I was planning to move on deploying, in order to have only aqs later on
[13:29:49] <joal>	 elukey: upload succeeded - I'm assuming text will as well
[13:31:11] <elukey>	 the host's metrics were not showing up any spike in cpu or memory
[13:31:51] <elukey>	 no oom killer
[13:31:59] <elukey>	 hive logs are not very indicative
[13:32:23] <elukey>	 and both server2 and metastore down
[13:32:24] <elukey>	 mmmm
[13:33:02] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537633 (https://phabricator.wikimedia.org/T212854) (owner: 10Joal)
[13:33:31] <wikibugs>	 (03PS2) 10Elukey: Force execution of all the (python) scripts under bin/ with python3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537255 (https://phabricator.wikimedia.org/T204735)
[13:33:39] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Force execution of all the (python) scripts under bin/ with python3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537255 (https://phabricator.wikimedia.org/T204735) (owner: 10Elukey)
[13:33:51] <elukey>	 joal: worst that can happen is that we'll restore one script
[13:33:54] <elukey>	 please go ahead :)
[13:34:02] <joal>	 ack !!
[13:35:53] <joal>	 !log Deploying refinery using scap
[13:35:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:41:26] <joal>	 ottomata: hello?
[13:41:43] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) 05Open→03Resolved @elukey thanks, works now. Closing this taks.
[13:41:58] <joal>	 !log Deploy refinery to hdfs
[13:42:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:44:56] <joal>	 Gone back to kids
[13:46:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Thanks to the awesome work of @Jclark-ctr an-presto1001 and an-presto1003 are now reimaged, but an-p...
[13:47:59] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Proposed fix for asw2-b:  ` delete interfaces interface-range cloud-hosts1-b-eqiad member xe-4/0/5 s...
[13:48:49] <mforns>	 hey elukey I'm back
[13:48:59] <elukey>	 hola!
[13:49:18] <elukey>	 I merged https://gerrit.wikimedia.org/r/537255 to unblock joal, buuut if you could triple check I'd be grateful :)
[13:49:25] <mforns>	 ok
[13:49:51] <elukey>	 <3
[13:50:44] <elukey>	 4/5 presto nodes ready
[13:50:49] <elukey>	 last one standing :D
[13:52:17] <mforns>	 elukey, looks good to me!
[13:52:39] <elukey>	 super :)
[13:55:49] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10akosiaris) >>! In T225128#5503053, @elukey wrote: > Proposed fix for asw2-b: >  > ` > delete interfaces inte...
[14:02:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Committed:  ` elukey@asw2-b-eqiad# show | compare [edit interfaces interface-range vlan-cloud-hosts1...
[14:07:42] <elukey>	 last presto node reimaging \o/
[14:21:32] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 (10Ottomata) Just ran ` apt-get purge python-zmq python-tornado python-ua-parser python-urllib3 py...
[14:22:12] <ottomata>	 yay!
[14:22:22] <ottomata>	 mgerlach:  o/
[14:22:24] <ottomata>	 how goes?!
[14:26:27] <elukey>	 all presto nodes with buster
[14:41:40] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Ok so current status:  * All hosts reimaged to buster and working * Renamed hostnames in netbox * Wai...
[14:46:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) AWESOME thank youuuu
[14:50:24] <mgerlach>	 ottomata: o/ all good. thanks for your reply on the venv. 
[14:51:57] <mgerlach>	 easier to start with the mediawiki-utilities but you are right that performance-wise I should look into the hive table
[14:52:44] <ottomata>	 mgerlach: well, a nice thing about not using mwxml is...you don't need the xml, and you don't need to parse any exml
[14:52:44] <ottomata>	 xml
[14:53:31] <mgerlach>	 ottomata: indeed, thats a big plus (although the tool takes away almost all the xml-awfulness)
[14:56:06] <ottomata>	 mgerlach:  https://wikitech.wikimedia.org/wiki/SWAP#sql_magic
[14:56:42] <ottomata>	 haha although i just tried and got a big error
[14:56:44] <ottomata>	 with sql_magic....
[14:57:26] <ottomata>	  !pip install sql_magic in noteebook helped firrst
[14:58:17] <mgerlach>	 ottomata: do you recommend running via notebooks?
[14:58:27] <ottomata>	 depends on what you are doing
[14:58:58] <ottomata>	 if you like jupyter notebooks go ahead...but they aren't the easiest things for us to maintain, so they often have problems...
[14:59:49] <ottomata>	 you can do the same stuff from cli too (although not sql magic? not sure)
[15:00:28] <mgerlach>	 perfect. i will check it out (just got the access a few minutes ago ; )
[15:07:28] <ottomata>	 mgerlach: i changed the example at https://wikitech.wikimedia.org/wiki/SWAP#with_Hive_(MapReduce): to use mediawiki_wikitext_history
[15:08:22] <ottomata>	 the main thing to be careful about on with notebooks is to not store big data on the local hard drives
[15:08:27] <ottomata>	 they aren't huge, and that is what HDFS is for
[15:08:49] <mgerlach>	 ottomata: ok, got it. thanks
[15:09:32] <ottomata>	 oo, also, an important thing about any hive table
[15:09:32] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Use_partitions
[15:09:47] <ottomata>	 not all tables have the same partitions fields
[15:09:52] <ottomata>	 but if you show create table or describe the table
[15:09:59] <ottomata>	 it will tell you which ones are partition fields
[15:10:04] <ottomata>	 you can also do
[15:10:11] <ottomata>	 show partitions <table> to see what is avialable
[15:10:23] <ottomata>	 although for some tables there will be a LOT of partitions
[15:33:39] <elukey>	 ottomata: ops sync?
[15:33:57] <ottomata>	 oh ho
[15:33:58] <ottomata>	 yes
[15:36:18] <icinga-wm>	 PROBLEM - Check the last execution of refine_eventlogging_eventbus_job_queue on an-coord1001 is CRITICAL: NRPE: Command check_check_refine_eventlogging_eventbus_job_queue_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:55:54] <elukey>	 ah this is due to the eventbus cleanup
[15:55:57] <elukey>	 np
[15:57:40] <ottomata>	 i think that is a laggy icinga
[15:57:41] <wikibugs>	 10Analytics: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10elukey)
[15:57:43] <ottomata>	 running puppet on icinga
[16:01:04] <nuria>	 ping joal 
[16:01:11] <wikibugs>	 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10ayounsi) 05Open→03Resolved All good here. Thanks!
[16:18:00] <icinga-wm>	 PROBLEM - Check the last execution of refinery-drop-webrequest-raw-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:18:37] <joal>	 elukey: this feels like a p3 related error --^
[16:19:06] <elukey>	 buuuuu
[16:19:07] <elukey>	 :)
[16:19:09] <elukey>	 checking
[16:19:17] <joal>	 Thanks :)
[16:25:54] <icinga-wm>	 PROBLEM - Check the last execution of refinery-drop-apiaction-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-apiaction-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:26:26] <icinga-wm>	 PROBLEM - Check the last execution of refinery-drop-eventlogging-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:26:50] <elukey>	 yes yes bad Luca
[16:26:56] <elukey>	 uffff
[16:27:22] <elukey>	 set some downtime
[16:28:07] <joal>	 what happens elukey ?
[16:28:14] <joal>	 deleted wrong files?
[16:28:23] <elukey>	 nono the script doesn't even run
[16:28:36] <elukey>	 I think it is a docopt issue
[16:29:43] <joal>	 Mwarf :(
[16:31:41] <wikibugs>	 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria)
[16:31:52] <wikibugs>	 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) p:05Triage→03Normal
[16:32:46] <icinga-wm>	 PROBLEM - Check the last execution of refinery-drop-cirrussearchrequestset-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-cirrussearchrequestset-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:33:40] <wikibugs>	 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) We can, load this data into a "shadow" table, v2 in cassandra and after swap the current table by the other one.
[16:35:40] <icinga-wm>	 PROBLEM - Check the last execution of refinery-drop-eventlogging-client-side-partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-client-side-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:36:08] <joal>	 elukey: should I rollback the change?
[16:36:56] <elukey>	 joal: I am still testing, the rollback can be a manual oneliner on an-coord1001, will do it if I can't find the issue
[16:37:07] <joal>	 k elukey 
[16:37:08] <elukey>	 first was that python3-mock wasn't on an-coord1001
[16:37:11] <elukey>	 just installed it
[16:39:39] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 (10Ottomata) Phew ok after all those patches, I think we are good with puppet code cleanup!
[16:40:34] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 (10Ottomata)
[16:44:02] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) So it seems that the `refinery-drop-older-than` script has some issues with python3:  1) python3-mock was not installed, but no errors were shown due to `sys.stderr = open(os.devnull,...
[16:44:32] <icinga-wm>	 PROBLEM - Check the last execution of camus-eventbus on an-coord1001 is CRITICAL: NRPE: Command check_check_camus-eventbus_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:45:01] <elukey>	 !log manually set "#!/usr/bin/env python" for refinery-drop-older-than on an-coord1001 to restore functionality (minor bug encountered)
[16:45:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:46:00] <elukey>	 !log manually restarted the refinery-drop-older-than jobs
[16:46:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:46:12] <icinga-wm>	 RECOVERY - Check the last execution of refinery-drop-eventlogging-client-side-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-eventlogging-client-side-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:46:37] <elukey>	 joal: manually restored python2, everything works now.. will work with mforns on a fix, but I'd like to leave the rest running to see if we encounter more issues
[16:46:40] <elukey>	 if you are ok
[16:47:00] <mforns>	 elukey, can I help?
[16:47:09] <joal>	 no problem for me - there might be some alarms at new day
[16:47:43] <elukey>	 mforns: o/ - it is my bad, I was too confident with python3 :) - I am getting https://phabricator.wikimedia.org/T204735#5503751
[16:48:01] <elukey>	 it is probably a quick change, but I guess that the script needs to be tested more with python 3
[16:49:02] <icinga-wm>	 RECOVERY - Check the last execution of refinery-drop-webrequest-raw-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-webrequest-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:49:28] <mforns>	 elukey, ok, will change now
[16:50:54] <elukey>	 mforns: nono I already rolled back manually, whenever you have time :)
[16:51:00] <mforns>	 oh ok
[16:51:04] <elukey>	 this bit hash_message = bytes(sorted(hash_args.items()))
[16:51:06] <elukey>	 leads to
[16:51:13] <elukey>	 TypeError: 'tuple' object cannot be interpreted as an integer
[16:51:19] <mforns>	 yea... weird
[16:51:33] <mforns>	 I don't see any tuple
[16:53:25] <icinga-wm>	 RECOVERY - Check the last execution of refinery-drop-cirrussearchrequestset-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-cirrussearchrequestset-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:54:04] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) ` >>> hash_args = {'a': 'a', 'b': 'b'} >>> hash_message = bytes(sorted(hash_args.items())) Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: 'tuple'...
[16:54:07] <elukey>	 mforns: --^
[16:54:41] <mforns>	 elukey, yea, items() generates a list of tuples
[16:56:12] <elukey>	 so it seems that applying bytes() to [('a', 'a'), ('b', 'b')] differs in python3
[16:57:45] <elukey>	 stepping afk 10 mins :)
[16:58:31] <mforns>	 k
[17:10:02] <icinga-wm>	 RECOVERY - Check the last execution of refinery-drop-apiaction-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-apiaction-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:11:38] <icinga-wm>	 RECOVERY - Check the last execution of refinery-drop-eventlogging-partitions on an-coord1001 is OK: OK: Status of the systemd unit refinery-drop-eventlogging-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:21:01] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) help(bytes) in python2 leads to help(str), meanwhile on python3:  ` class bytes(object)  |  bytes(iterable_of_ints) -> bytes  |  bytes(string, encoding[, errors]) -> bytes  |  bytes(b...
[17:21:20] <elukey>	 mforns: going afk for the day, tomorrow if you have time let's check --^
[17:21:32] <elukey>	 (will read later)
[17:41:00] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10mforns) Thanks Luca for the pastes. I changed a bit the syntax to adapt to python3. And tested that everything is ok :]. Luckily, the checksums match the ones generated by python2, so we won'...
[17:41:38] <wikibugs>	 (03PS1) 10Mforns: Fix python3 incompatibility in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537705 (https://phabricator.wikimedia.org/T204735)
[17:43:34] <wikibugs>	 10Analytics, 10DBA, 10Data-Services: Prepare and check storage layer for hi.wikisource - https://phabricator.wikimedia.org/T219374 (10Urbanecm) a:03Marostegui Database was created.
[18:02:16] <leila>	 nuria: are you joining?
[18:02:34] <nuria>	 leila: ah, yes, sorry
[18:09:53] <joal>	 !log Restart eventlogging with new ua-parser (ottomata did)
[18:09:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:10:42] <joal>	 !log Kill-restart webrequest-load oozie job to pick-up new ua-parser
[18:10:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:16:22] <joal>	 !log Start mediawiki-history-dumps oozie job starting with August 2019
[18:16:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:19:33] <elukey>	 so far it seems no moar python failures right?
[18:19:59] <elukey>	 gooood
[18:40:29] <wikibugs>	 (03PS1) 10Joal: Update aqs to 3df76ab [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537721
[18:40:59] <joal>	 fdans: if you're still here, would you mind? --^
[18:42:30] <joal>	 if not, I'll do it myself (aqs latest commit-sha matches the one in the above aqs-deploy patch, which contains only node_modules changes in addition to the src one)
[18:44:31] <joal>	 ok - merging for deploy myself :)
[18:44:45] <joal>	 ottomata: are you nearby (in case, I'm deploying AQS soon)
[18:45:36] <ottomata>	 joal:  8 mins ok?
[18:45:43] <joal>	 np ottomata - Merging and preping
[18:47:30] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537721 (owner: 10Joal)
[18:51:09] <joal>	 Ok - new user-agent is deployed :)
[18:51:52] <joal>	 Updating webrequest and pageview doc with high-level doc, will provide details tomorrow on dedicated page
[18:53:13] <ottomata>	 k here
[18:53:25] <joal>	 Ok ottomata - Everything is ready, deploying :)
[18:53:28] <ottomata>	 k
[18:53:33] <joal>	 !log Deploy AQS using scap
[18:53:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:55:07] <joal>	 Mwarf -faile
[18:56:52] <ottomata>	 oh ?
[18:56:53] <ottomata>	 fail how?
[18:57:03] <joal>	 problem of automated testing
[18:57:20] <joal>	 I had inserted the data, so I don't get it
[18:58:06] <joal>	 Ok I get it - Will correct this now
[18:58:36] <joal>	 I should have been more concentrated at CR
[18:59:29] <joal>	 !log Deploy AQS using scap - Try 2
[18:59:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:01:24] <joal>	 ok - second fail
[19:02:11] <ottomata>	 joal:  am here if you want me to help, let me know
[19:02:30] <joal>	 Thanks ottomata - not ops related I think
[19:03:44] <joal>	 Very much ...
[19:04:03] <joal>	 ok - Will provide a patch and redeploy
[19:05:25] <wikibugs>	 (03PS1) 10Joal: Fix mediarequest per referer sample request [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537728
[19:05:33] <joal>	 ottomata: --^ if you don't mind
[19:05:47] <joal>	 ottomata: maybe more details in commit message I guess
[19:10:07] <joal>	 ottomata: ping?
[19:10:13] <ottomata>	 joal:  ya
[19:10:15] <ottomata>	 sorry
[19:10:26] <joal>	 Thanks :)
[19:10:28] <ottomata>	 k
[19:10:32] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix mediarequest per referer sample request [analytics/aqs] - 10https://gerrit.wikimedia.org/r/537728 (owner: 10Joal)
[19:13:00] <wikibugs>	 (03PS1) 10Joal: Update aqs to 7a89363 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537731
[19:13:56] <ottomata>	 brb getting coffee
[19:14:07] <joal>	 sure ottomata 
[19:14:13] <joal>	 Will merge and prep for deploy
[19:14:33] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging fix for deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/537731 (owner: 10Joal)
[19:20:48] <wikibugs>	 10Analytics, 10Operations, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10herron) p:05Triage→03Normal
[19:23:22] <ottomata>	 bavck
[19:23:31] <joal>	 k ottomata - deploying :)
[19:23:41] <ottomata>	 k
[19:23:50] <joal>	 !log Deploy AQS using scap - Try 3
[19:23:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:24:27] <joal>	 here we go - successfull canary
[19:27:28] <joal>	 ok deploy successful :)
[19:27:38] <ottomata>	 great!
[19:27:41] <joal>	 Thanks ottomata for the supervising :)
[19:27:47] <ottomata>	 i did a great job
[19:27:47] <ottomata>	 :p
[19:27:56] <joal>	 wikileader!
[19:28:23] <joal>	 ok - gone for tonight, deploy is finished :)
[19:28:34] <ottomata>	 ok, thanks joal, laterrrsss!
[19:28:36] <joal>	 more docs on ua-parser tomorrow (basics are here)
[20:36:35] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix python3 incompatibility in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/537705 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns)
[23:17:11] <wikibugs>	 10Analytics, 10Anti-Harassment, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10nettrom_WMF) 05Open→03Resolved [[ https://meta.wikimedia.org/wiki/Schema:AutoblockIpBlock | Schema:AutoblockIpBlock...
[23:17:13] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10nettrom_WMF)
[23:20:17] <wikibugs>	 10Analytics, 10Community-Tech, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10nettrom_WMF) @ifried or @aezell : not sure which one of you to contact, so I'm pinging you both, sorry! Would...
[23:24:16] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) note to self, turnilo neds double quotes and no spaces:  {"49":"ACK+URG+FIN","48":"ACK+URG","40":"PSH+URG","36":"RST+URG","34":"SYN+URG","33":"FIN+...
[23:31:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) {F30393170}
[23:34:32] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10ayounsi) LGTM! Thanks!