[00:08:50] <isaacj>	 thanks ottomata !
[00:58:08] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) Ok, should be fixed.    I'm still installing the new anaconda-wmf on all the worker nodes, etc, but the stat boxes should be good to go.
[00:59:05] <ottomata>	 Urbanecm: no the server can keep running, we might set up something to close down active kernels, but right now its ok.  If you start spark in yarn sessions, you should close those though.
[06:00:07] <elukey>	 good morning
[06:01:27] <elukey>	 wow so many alerts :D
[06:02:10] <elukey>	 ah one webrequest hour was killed
[06:03:11] <elukey>	 SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://analytics-hive.eqiad.wmnet:10000/default;principal=hive/analytics-hive.eqiad.wmnet@WIKIMEDIA: Failed to open new session: java.lang.RuntimeException: Couldn't create directory /tmp/e6961d81-7b18-48d7-bff3-9444b517411a_resources
[06:03:39] <elukey>	 mmmm
[06:03:43] <elukey>	 trying to re-run
[06:14:05] <elukey>	 all the other alarms are jobs waiting, so all dependent from webrequest_load failures IIUC
[06:18:21] <elukey>	 the two hours are refined now, jobs should be unblocked
[06:44:33] <elukey>	 !log re-deployed refinery to hadoop-test after fixing permissions on an-test-coord1001
[06:44:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:59:49] <joal>	 Thanks for the rerun elukey 
[07:02:10] <elukey>	 joal: bonjour :) started my ops week! 
[07:07:24] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/677803 is finally ready to go, to gzip the HDFS Namenode audit log
[07:07:41] <elukey>	 works in hadoop test, and it can be applicable to other big logs as well
[07:15:18] <elukey>	 joal: ok if we schedule the capacity scheduler deploy for Monday 19th?
[07:15:36] <elukey>	 I'll prep a wikipage about it and send an email to announce@ in case
[07:35:39] <joal>	 Yes monday 19th is good for the scheduler elukey :)
[07:35:43] <joal>	 thanks for checking :)
[07:36:10] <elukey>	 joal: perfect :)
[07:37:15] <elukey>	 joal: next week I'll also try to push for the hadoop masters + coord1001 to Buster, that should in theory complete the Buster rollout
[07:37:25] <joal>	 <3
[07:38:01] <elukey>	 there is a little reshape of the disk partitions as dependency, namely simplyfing all in a /srv ext4 volume/partition
[07:38:15] <elukey>	 (like we do in hadoop test and an-coord1002)
[07:38:33] <elukey>	 so it will be easier in the future to reimage preserving /srv with the same recipe
[07:38:52] <elukey>	 in theory moving to Debian 11 should be easier with the new partman recipes + fixed uid/gid
[07:40:08] <joal>	 makes sense elukey - I didn't know we were having a different partition scheme for coord
[07:40:36] <elukey>	 we have
[07:40:36] <elukey>	 /dev/mapper/an--coord1001--vg-srv    102G   47G   55G  47% /srv
[07:40:36] <elukey>	 /dev/mapper/an--coord1001--vg-mysql   59G   49G   11G  83% /var/lib/mysql
[07:41:13] <elukey>	 so my idea is just to fold mysql into /srv, like we do for an-coord1002
[07:41:27] <elukey>	 (similar thing for the masters, different partition names but multiple ones)
[07:41:28] <joal>	 +1
[07:43:08] <elukey>	 I am going to restart the hdfs namenodes on the prod masters to apply the new log4j settings for the audit log
[07:43:14] <joal>	 ack
[07:43:22] <joal>	 checking if anything goes wrong
[07:44:14] <elukey>	 !log restart hadoop hdfs masters on an-master100[1,2] to apply the new log4j settings fro the audit log
[07:44:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:44:38] <elukey>	 procedure is
[07:44:40] <elukey>	 failover to 1002
[07:44:43] <elukey>	 restart 1001
[07:44:44] <elukey>	 wait
[07:44:55] <elukey>	 failover to 100
[07:44:57] <elukey>	 *1001
[07:44:59] <elukey>	 restart 1002
[07:45:06] <elukey>	 (the first is done)
[07:48:05] <elukey>	 1001 is starting
[07:51:37] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Upgrade furud/flerovium to Debian Buster - https://phabricator.wikimedia.org/T278421 (10elukey) @razzi let's start working on this, let me know doubts/suggestions/etc.. :)
[08:06:55] <elukey>	 failover to 1001
[08:09:08] <joal>	 interesting elukey - I was testing gobblin while you were failing over - it managed itself correctly :)
[08:10:06] <elukey>	 joal: nice!
[08:10:23] <elukey>	 joal: ah one question from yesterday - is there a reason to keep a fork of Gobblin only for us?
[08:10:26] <joal>	 it was kinda expected, and indeed nice to see it in action :)
[08:10:51] <joal>	 elukey: interesting question
[08:10:56] <joal>	 elukey: two things
[08:11:29] <joal>	 elukey: we wish to use version of hadoop/hive that differs from the ones used in current gobblin version and incolve some changes in code (mostly for tests)
[08:11:51] <joal>	 elukey: and we need gobblin-specific code for wmf that we currently store in a gobblin submodule
[08:11:58] <elukey>	 okok
[08:12:15] <joal>	 elukey: if you have ideas/suggestions on how to manage differently, I'm super happy to try them
[08:12:30] * joal should be better at understanding how to push upstream :S
[08:12:51] <elukey>	 nono it makes sense, I was just hoping that we could have used the upstream version
[08:13:25] <joal>	 elukey: I'll probably send a patch with my changes for hadoop versions (they are the ones compatible with bigtop)
[08:16:14] <joal>	 elukey: possibly we could push for gobblin to be present in bigtop, meaning version reconciliations
[08:20:23] <joal>	 also elukey, not sure if you noticed: as planned, work on mediawiki-text has started on the cluster :)
[08:29:35] <elukey>	 here I am sorry, coffee :)
[08:29:56] <elukey>	 is it a gentle way to say "stop doing weird changes to the cluster" ? :D
[08:30:18] <joal>	 elukey: absolutely no!!!
[08:30:35] <joal>	 elukey: it was more about the good thing of having a bigger cluster :)
[08:30:47] <joal>	 elukey: we have full root partition on stat1008 :(
[08:30:55] <elukey>	 ahhh okok :D :D
[08:31:00] <joal>	 elukey: sorry for the interrupt :S
[08:31:02] <elukey>	 yep I saw it, going to check in a sec
[08:31:07] <joal>	 <2
[08:31:23] <elukey>	 ah the root partition!
[08:31:29] <elukey>	 this is a new use case
[08:32:00] <joal>	 elukey: my work on gobblin uses it from what I can see (/tmp) - let me know if it's me causing the problem
[08:32:44] <elukey>	 joal: nono there are 24G in /var/spool/ryslog, no idea why
[08:33:00] <joal>	 elukey: it could be me and my tests
[08:35:06] <elukey>	 !log apt-get clean on stat1008 to free some space
[08:35:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:37:24] <joal>	 elukey: should I wait for my jobs, or can I go?
[08:37:40] <elukey>	 joal: you can go now, I think it is conda/jupyterhub related
[08:38:09] <joal>	 Ah ok - thanks elukey :)
[08:49:30] <elukey>	 joal: I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/677816/
[08:49:33] <elukey>	 that should help
[08:50:46] <joal>	 awesome elukey
[09:19:10] <elukey>	 joal: I also didn't forget about the stat1008 that you observed yesterday, filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/677822
[09:20:41] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10elukey) The hdfs audit config has been deployed correctly on the prod master nodes, let's wait a couple of days to see how it goes. After that, we could m...
[09:30:38] <wikibugs>	 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10Qgil) Thank you so much! This task can be resolved.  When the actual election timeline is shared, we are likely to file a task about a precise calculation of eligible...
[09:31:06] <elukey>	 I am checking https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop#Queues
[09:31:22] <elukey>	 and we have been suggesting to use the 'nice' queue in the past to users
[09:32:14] <joal>	 correct elukey, we have been suggesting this - but nobody does it anymore :)
[09:32:19] <elukey>	 is it ok to say that we'll replace both 'sequential' and 'nice' with fifo?
[09:32:47] <joal>	 elukey: sequential => fifo, 'nice' is dead
[09:33:25] <elukey>	 ok so basically remove also all the suggestions about nice, and use default instead
[09:33:54] <elukey>	 what is the use case for fifo/sequential? (just want to put the correct words on the wiki)
[09:37:38] <joal>	 elukey: the fifo use case has been braught by Erik, to run job sequentially one after the other
[09:37:49] <joal>	 I don't have more about what the job does
[09:52:09] <elukey>	 joal: okok, so maybe Erik doesn't need it anymore
[09:52:57] <elukey>	 anyway, we'll keep it for the GPU use case
[09:53:00] <elukey>	 thanks :)
[09:53:19] <joal>	 you're welcome :)
[10:25:57] <elukey>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop#Queues
[10:34:02] <addshore>	 o/
[10:34:18] <addshore>	 Anyone got any ideas why the log files in /srv/analytics-wmde/graphite/log would be empty (on stat1007) ?
[10:34:42] <addshore>	 It looks like the last log rotated file was daily.03.log-20210328 and since then the current log files have nothing in them
[10:35:04] <elukey>	 addshore: o/ I think that Amir moved some crons to systemd timers
[10:35:16] <addshore>	 aaah, so logs moved? :D
[10:36:46] <elukey>	 addshore: so logs by default are only on journald, that keeps them on tmpfs (so basically on ram, they are wiped if we reboot) unless instructed otherwise in puppet (namely the systemd::timer config setting logging etc..)
[10:36:54] <elukey>	 for the moment journald cannot be accessed by regular users, it needs sudo
[10:37:05] <addshore>	 :/
[10:37:14] <elukey>	 I just tried sudo journalctl -u wmde-analytics-minutely and the first log is march 28th at 21:31
[10:37:34] <elukey>	 addshore: did you see https://phabricator.wikimedia.org/T278665?
[10:37:35] <addshore>	 ack, okay
[10:37:50] <addshore>	 yeah, that was disabled as is fine
[10:37:54] <addshore>	 this is about the other scripts
[10:37:58] <elukey>	 yes yes :)
[10:38:15] <addshore>	 could you grep the daily logs for `wikidata-site_stats-active_users_by_namespace` and pastebion the outptu somewhere ?
[10:38:41] <elukey>	 daily early or noon?
[10:38:51] <addshore>	 03
[10:39:23] <elukey>	 ??? :D
[10:39:43] <elukey>	 I mean wmde-analytics-daily-early.service vs wmde-analytics-daily-noon.service
[10:39:46] <addshore>	 early!
[10:39:49] <elukey>	 ah!
[10:39:56] <addshore>	 sorry, i didnt realize that the names changed too :D
[10:40:34] <elukey>	 so the only thing that I find are lines like
[10:40:38] <elukey>	 Apr 08 03:00:00 stat1007 time[7026]: 2021-04-08 03:00:00 wikidata-site_stats-active_users_by_namespace Script Started!
[10:40:47] <addshore>	 just a single line?
[10:41:00] <elukey>	 yes, but it may have more output later on
[10:41:39] <addshore>	 so yesterday there is only 1 output line too?
[10:41:48] <addshore>	 sounds like it is taking too long to run and being killed by something
[10:43:07] <elukey>	 addshore: you should have a file named wmde-analytics-daily-early.service.log in your home dir on stat1007
[10:43:10] <elukey>	 it contains all the logs
[10:43:12] <addshore>	 ty!
[10:43:44] <elukey>	 going afk for lunch, will check later :)
[10:44:31] <addshore>	 I think I found the issue `PHP Fatal error:  Uncaught TypeError: Argument 1 passed to WikidataActiveUsersByNamespace::collectNamespaces() must be of the type array, object given, called in /srv/analytics-wmde/graphite/src/scripts/src/wikidata/site_stats/active_users_by_namespace.php on line 26 and defined in /srv/analytics-wmde/graphite/src/scripts/src/wikidata/site_stats/active_users_by_namespace.php:55`
[11:02:05] <elukey>	 sigh lunch break delayied, I have to restart masters again, the log4j config was not super correct
[11:02:08] <elukey>	 1G vs 1GB
[11:14:05] <hnowlan>	 !log created aqs user and loaded full schemas into analytics wmcs cassandra
[11:14:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:30:50] <joal>	 :/
[11:31:24] <elukey>	 yeah not even with 1GB is fine, I thought it was
[11:31:35] <joal>	 wut?
[11:31:42] <joal>	 what is it expected then?
[11:31:46] <joal>	 moar memoary?
[11:33:40] <elukey>	 nono I think it needs bytes
[11:33:45] <elukey>	 so 1000000000
[11:33:51] <joal>	 pfff :(
[11:33:55] <elukey>	 https://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/SizeBasedTriggeringPolicy.html#setMaxFileSize(long)
[11:34:18] <joal>	 in the era of 5g, who still talks in bytes?
[11:35:16] <wikibugs>	 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10JAllemandou) Ping @kzimmerman on the above comment - Let's synchronize on who does what :)
[11:35:27] <wikibugs>	 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10JAllemandou) 05Open→03Resolved a:03JAllemandou
[11:35:55] <elukey>	 I tested it with bytes on hadoop test, and naively thought that 1GB would have been ok
[11:36:02] <elukey>	 amusingly it doesn't fail
[11:36:15] <elukey>	 it just rotates files after few mbs
[11:36:21] <joal>	 this -^ is scary
[11:37:26] <elukey>	 I am failing over/restarting again, hopefully this time is the last one
[11:46:18] <hnowlan>	 AQS appears to be running okay but it's not listening on the configured port - nothing in the logs that I can see that indicates something being broken. I feel like i've hit this before in other service-runner services but I can't remember what the problem was. Any ideas? 
[11:47:38] <elukey>	 hnowlan: there is a setting in the yaml config to send logs to logstash IIRC, if you remove it and restart it should emit everything in the logs
[11:48:20] <hnowlan>	 yeah, I have it logging to stdout atm but there's no indication of failures that I can see
[11:51:10] <elukey>	 mmm
[11:51:47] <elukey>	 hnowlan: log level? Can it be changed?
[12:32:57] <wikibugs>	 (03PS1) 10Silvan Heintze: Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677871 (https://phabricator.wikimedia.org/T275999)
[12:37:17] <wikibugs>	 (03CR) 10Silvan Heintze: "I hope this fixes the error that prevents the script from running in production. No tests here, unfortunately." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677871 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[12:54:58] <hnowlan>	 elukey: I have it on debug already :( I think it's some quirk on the config, I'll keep digging 
[13:15:52] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) Ok, all should be back to normal!
[13:16:03] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10Ottomata) p:05Triage→03High
[13:31:31] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Datasets-General-or-Unknown, 10Patch-For-Review: [Legal] Downloads license should mention CC0 for Analytics datasets - https://phabricator.wikimedia.org/T278409 (10Ottomata) Ok, done! https://dumps.wikimedia.org/legal.html
[13:36:35] <klausman>	 everyone, I'm gonna take teh rest of the afternoon off, the weather here is *way* too nice to be sitting inside (gotta get that vitamin D!)
[13:37:13] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Readers-Web-Backlog: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10Ottomata)
[13:44:19] <elukey>	 klausman: ack!
[13:44:42] <elukey>	 joal, ottomata - 1GB of gzipped hdfs-audit log is... 27MB :D
[13:45:26] <elukey>	 just tried to copy and decompress, it gets back to ~950MB
[13:45:55] <elukey>	 and it makes sense, the entries are repeating a lot so text compression works really nicely
[13:46:37] <elukey>	 for the moment we'll keep 10x1GB worth of logs (ending up in ~27MBx10 on disk), but we can expand the max file size to say 5GB easily
[13:46:45] <elukey>	 keeping a lot more of history
[13:46:51] <elukey>	 super happy about it
[13:47:49] <klausman>	 elukey: note that compress-once-uncompress-often use cases hugele benefit from xz/lzma or zstd
[13:48:23] <klausman>	 gzip/lempel-Ziv is >50 years old by now :)
[13:48:45] <elukey>	 klausman: no idea if anything different by gz is supported by the log4j appender, I suspect not..
[13:49:10] <elukey>	 I am currently happy about this result :D
[13:49:24] <klausman>	 Sure :)
[13:49:27] <elukey>	 (the log4j config to achieve it is a bit mental)
[13:52:23] <klausman>	 I *think* log4j might support .xz, but probably not worth the bother.
[13:55:19] <ottomata>	 +1 elukey  nice!
[14:07:22] <elukey>	 !log drop /var/spool/rsyslog from stat1008 - corrupted files due to root partition filled up caused a SEGV for rsyslog
[14:07:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:09:55] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Add makefile and dockerfile for local tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 (owner: 10Hnowlan)
[14:11:20] <wikibugs>	 (03Merged) 10jenkins-bot: Add makefile and dockerfile for local tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 (owner: 10Hnowlan)
[14:22:44] <wikibugs>	 (03CR) 10Tonina Zhelyazkova: [C: 03+1] Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677871 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[14:23:31] * elukey bbiab! getting some oxygen from outside :D
[14:27:52] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Fix PHP Fatal error (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677871 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[14:31:28] <mforns>	 elukey: thanks for looking into yesterday's deployment error, and also the webrequest alerts
[14:41:44] <wikibugs>	 (03PS1) 10Silvan Heintze: Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677722 (https://phabricator.wikimedia.org/T275999)
[14:43:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677871 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[14:44:18] <wikibugs>	 (03Merged) 10jenkins-bot: Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677871 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[15:01:07] <ottomata>	 dan, joal  yt?
[15:01:21] <ottomata>	 milimetric: ^ ?
[15:01:34] <milimetric>	 yea hey ottomata 
[15:02:30] <ottomata>	 https://docs.google.com/document/d/1vcSX2FUCGG52VHJDXAELziyMNCaEjWGdv_k-8Jp3IGg/edit#
[15:02:43] <ottomata>	 this is not meant yet for public consumption, just mostly writing that up for desiree
[15:03:09] <ottomata>	 lemme know what you think / if it makes any sense
[15:03:16] <ottomata>	 just wanted someone to look over it before I send it to her
[15:10:13] <milimetric>	 ottomata: it looks good to me.  Is this reflected properly in the annual plan?
[15:10:28] <ottomata>	 i don't think so
[15:10:35] <ottomata>	 but maybe we can get it there?
[15:10:41] <ottomata>	 i've ben talking with tajh etc. abou tit
[15:10:46] <ottomata>	 but i dunno
[15:12:18] <mforns>	 fdans: yesterday I didn't start the pageview monthly dumps, because it was rather late when I finished the deployment, and I lost your etherpad notes... when did you want me to start that job for?
[15:13:35] <wikibugs>	 (03Abandoned) 10Silvan Heintze: Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677722 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[15:22:03] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) ` elukey@an-worker1100:~$ sudo megacli -AdpBbuCmd -BbuLearn -aAll                                       Adapter 0: BBU Learn Failed  Exit Code: 0x01 `  This is also weird..
[15:23:17] <wikibugs>	 10Analytics, 10WMCZ-Stats: Review request: New datasets for WMCZ published under analytics.wikimedia.org - https://phabricator.wikimedia.org/T279567 (10Urbanecm)
[15:23:24] <wikibugs>	 (03PS1) 10Silvan Heintze: Fix SQL query field name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677935 (https://phabricator.wikimedia.org/T275999)
[15:24:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix SQL query field name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677935 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[15:28:37] <wikibugs>	 (03Merged) 10jenkins-bot: Fix SQL query field name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677935 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[15:32:14] <elukey>	 joal: qq - is it ok if I reboot an-worker1100?
[15:33:42] <elukey>	 I don't see the mediawiki-text job running but I might not look in the right direction
[15:35:54] <elukey>	 !log reboot an-worker1100 to see if it helps with the strange BBU behavior in T279475
[15:35:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:35:58] <stashbot>	 T279475: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475
[15:38:34] <fdans>	 mforns: don't worry, I'll start it :)
[15:39:01] <mforns>	 fdans: I have the command all prepared and ready to fire, just need the timestamp :]
[15:39:44] <fdans>	 mforns: I didn't include it in the command? it should be April
[15:40:22] <fdans>	 like 2021-04-01T00:00
[15:41:19] <razzi>	 Hiya team
[15:43:22] <mforns>	 fdans: ok
[15:43:23] <razzi>	 !log rebalance kafka partitions for webrequest_text partitions 17, 18
[15:43:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:45:18] <mforns>	 hey razzi :]
[16:00:19] <mforns>	 hey a-team, will be 2 mins late to standup
[16:00:37] <fdans>	 mforns: cool!
[16:05:01] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Research: Webrequest.isWMFDomain should return true for .wmflabs.org domains. - https://phabricator.wikimedia.org/T277536 (10Ottomata) a:03JAllemandou
[16:05:27] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Upgrade the Cassandra AQS cluster to Cassandra 3.11 - https://phabricator.wikimedia.org/T255141 (10hnowlan)
[16:06:02] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Upgrade the Cassandra AQS cluster to Cassandra 3.11 - https://phabricator.wikimedia.org/T255141 (10hnowlan) a:03hnowlan
[16:07:57] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Research: Webrequest.isWMFDomain should return true for .wmflabs.org domains. - https://phabricator.wikimedia.org/T277536 (10Ottomata) p:05Triage→03Medium
[16:14:34] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) The alert recovered, but I discovered a bad disk that needs to be replaced (had to clear preserved cache to allow boot, and one partition didn't mount). Hopefully we'll get...
[16:23:39] <wikibugs>	 10Analytics, 10Analytics-Kanban: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10fdans) 05Open→03Resolved
[16:23:47] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10fdans) 05Open→03Resolved
[16:23:49] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10fdans)
[16:23:57] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Datasets-General-or-Unknown: [Legal] Downloads license should mention CC0 for Analytics datasets - https://phabricator.wikimedia.org/T278409 (10fdans) 05Open→03Resolved
[16:24:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Better Use Of Data: Optimize intermediate session length data set and dashboard - https://phabricator.wikimedia.org/T277512 (10fdans) 05Open→03Resolved
[16:24:07] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10fdans) 05Open→03Resolved
[16:24:09] <wikibugs>	 10Analytics, 10Analytics-Kanban: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10fdans)
[16:24:17] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter conda bug: cannot spawn new server - https://phabricator.wikimedia.org/T279480 (10fdans) 05Open→03Resolved
[16:24:22] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10fdans)
[16:24:34] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: /wmf/data/raw should be readable by analytics-privatedata-users - https://phabricator.wikimedia.org/T275396 (10fdans) 05Open→03Resolved
[16:24:39] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Update sqoop to work with multi-instance clouddb1021 mariadb host - https://phabricator.wikimedia.org/T274690 (10fdans) 05Open→03Resolved
[16:24:46] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10fdans)
[16:24:52] <wikibugs>	 10Analytics, 10Analytics-Kanban: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10fdans) 05Open→03Resolved
[16:24:56] <wikibugs>	 10Analytics, 10Analytics-Kanban: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10fdans) 05Open→03Resolved
[16:24:58] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10fdans) 05Open→03Resolved
[16:25:05] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Research: Webrequest.isWMFDomain should return true for .wmflabs.org domains. - https://phabricator.wikimedia.org/T277536 (10fdans) 05Open→03Resolved
[16:25:07] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10fdans) 05Open→03Resolved
[16:25:12] <wikibugs>	 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Tracking-Neverending: Superset Updates - https://phabricator.wikimedia.org/T211706 (10fdans)
[16:25:14] <wikibugs>	 10Analytics, 10Event-Platform, 10Research: TranslationRecommendation* Schemas Event Platform Migration - https://phabricator.wikimedia.org/T271163 (10fdans)
[16:25:16] <wikibugs>	 10Analytics, 10Analytics-Kanban: Create Spark code to compare DateTimes with partition columns - https://phabricator.wikimedia.org/T212451 (10fdans) 05Open→03Resolved
[16:26:10] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) One drive is in a Foreign state, no idea why (also unconfigured - good):  ` Enclosure Device ID: 32 Slot Number: 10 Enclosure position: 1 Device Id: 10 WWN: 5000c500cf8ee990...
[16:29:29] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) 05Resolved→03Open >>! In T269211#6974963, @elukey wrote: > Almost! There are a couple of things left: >  > - clou...
[16:33:42] <elukey>	 !log reboot an-worker1100 again to check if all the disks come up correctly
[16:33:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:34:59] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) I had to do: ` megacli -CfgForeign -Scan -a0 megacli -CfgForeign -Clear -a0 megacli -CfgLdAdd -r0 [32:10] -a0 `  And the disk came back to life and I was able to re-mount it...
[16:38:29] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @elukey Good point. With regards to the memory warning, we can:  - lower the memory allocated to the sections, but thi...
[16:40:05] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) We have the following in hiera:  ` # clouddb1021 profile::base::notifications: disabled [..] ` That needs to be remov...
[16:41:30] <wikibugs>	 (03PS1) 10Silvan Heintze: Final fixes to get editors split by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677975 (https://phabricator.wikimedia.org/T275999)
[16:42:07] <fdans>	 razzi yo!
[16:42:19] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: Icinga/MegaRAID alert on an-worker1100 - https://phabricator.wikimedia.org/T279475 (10elukey) 05Open→03Resolved a:03elukey All good, I'll re-open in case something weird comes up, but now all disks are good :)
[16:46:15] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Final fixes to get editors split by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677975 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[16:47:17] <wikibugs>	 (03Merged) 10jenkins-bot: Final fixes to get editors split by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677975 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[16:56:42] <wikibugs>	 (03PS1) 10Addshore: Remove unneeded alias [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677949
[16:56:47] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Remove unneeded alias [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677949 (owner: 10Addshore)
[16:57:55] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unneeded alias [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677949 (owner: 10Addshore)
[16:58:25] <wikibugs>	 (03Restored) 10Addshore: Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677722 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[16:58:31] <wikibugs>	 (03PS2) 10Addshore: Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677722 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[16:58:51] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677722 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[17:00:51] <wikibugs>	 (03PS1) 10Addshore: Fix SQL query field name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677950 (https://phabricator.wikimedia.org/T275999)
[17:01:05] <wikibugs>	 (03PS1) 10Addshore: Final fixes to get editors split by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677951 (https://phabricator.wikimedia.org/T275999)
[17:01:10] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Fix SQL query field name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677950 (https://phabricator.wikimedia.org/T275999) (owner: 10Addshore)
[17:01:13] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Final fixes to get editors split by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677951 (https://phabricator.wikimedia.org/T275999) (owner: 10Addshore)
[17:02:04] <wikibugs>	 (03Merged) 10jenkins-bot: Fix PHP Fatal error [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677722 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze)
[17:03:30] <wikibugs>	 (03Merged) 10jenkins-bot: Fix SQL query field name [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677950 (https://phabricator.wikimedia.org/T275999) (owner: 10Addshore)
[17:03:33] <wikibugs>	 (03Merged) 10jenkins-bot: Final fixes to get editors split by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/677951 (https://phabricator.wikimedia.org/T275999) (owner: 10Addshore)
[17:46:41] <elukey>	 razzi: ok I am ready
[17:46:45] <elukey>	 is the bc busy?
[17:46:51] <razzi>	 elukey: yeah, let's tardis
[17:47:16] <razzi>	 https://meet.google.com/kti-iybt-ekv
[17:47:16] <elukey>	 razzi: I lost the quick link, can you paste it?
[17:47:20] <elukey>	 thanks :)
[17:49:31] <hnowlan>	 I'm stumped by this aqs not-listening thing. petr wasn't sure what it might be either, might need to get some help with it on tuesday because I am not good at js 
[17:53:00] <hnowlan>	 the yolops in me says "it can connect to cassandra 2.x fine, which is where the real change was", could try carefully deploying to a single prod server and see what happens
[17:53:14] <hnowlan>	 rather than try to fix the analytics cluster itself 
[17:53:22] <hnowlan>	 er wmcs cluster that is
[18:15:34] <elukey>	 hnowlan: we could add the new aqs nodes to a new scap env or just simply to the scap config, and then scap deploy --limit new-hostname
[18:16:47] <elukey>	 what node are you working on in wmcs?
[18:17:33] <elukey>	 it would be nice to fix it so we get a test cluster
[18:23:25] <joal>	 milimetric: do you have a minute for gobblin sync?
[18:23:38] <elukey>	 hnowlan: I removed the cee logging, set log levle to trace and I see
[18:23:41] <elukey>	 "err":{"message":"All host(s) tried for query failed. First host tried, 172.16.3.23:9042: AuthenticationError: Authentication provider not set. See innerErrors.","name":"NoHostAvailableE
[18:23:45] <elukey>	 me":"AuthenticationError","info":"Represents an authentication error from the driver or from a Cassandra node.","message":"Authentication provider not set"}
[18:27:46] <elukey>	 the 'password' field in the config.yaml is empty, not sure what it is 
[18:27:53] <elukey>	 tried cqlsh but it seems broken
[18:29:38] <elukey>	 anyway, let's check this on tue :)
[18:29:44] <elukey>	 have a good (long) weekend folks!
[18:29:50] <joal>	 bye elukey :)
[18:31:08] <wikibugs>	 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10kzimmerman) @JAllemandou @qgil I think it makes sense for this kind of task to be handled by Product Analytics in the future.  @Niharika & @DannyH - on the product si...
[18:42:16] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add better monitoring for Analytics UIs - https://phabricator.wikimedia.org/T277729 (10razzi)
[18:45:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add better monitoring for Analytics UIs - https://phabricator.wikimedia.org/T277729 (10razzi) Alright, nagios doesn't support it directly, but we can use a shell script, like [modules/profile/files/eventstreams/check_eventstreams.sh](https://github.com/wik...
[18:58:58] <joal>	 Gone for tonight - Enjoy the long weekend folks :)
[20:31:20] * razzi lunchtime!
[20:42:45] <isaacj>	 milimetric: https://turnilo-public.wmcloud.org/ (just running their example data right now)
[20:45:30] <isaacj>	 looks like both of those example datasets are just large JSON files so i'll see how our data looks in that format (with fake values). thanks for thinking of this!
[20:49:50] <nuria>	 turnilo-public, nice!
[20:56:19] <isaacj>	 nuria: yep, just testing out as a possibility. context is that T270140 is a dataset with too many facets to really play nicely with dashiki or other approaches
[20:56:20] <stashbot>	 T270140: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140
[20:59:53] <nuria>	 isaacj: is data stored in the druid backend that powers the wikistats api?
[21:08:44] <isaacj>	 nuria: no, right now the idea is local flatfiles until we figure out a more robust solution
[21:26:34] <milimetric>	 yeah, no worries, we didn't get too fancy since you left :)
[21:28:41] <milimetric>	 nice isaacj, yeah, it's probably going to work out.  I'd be interested how they did the dimension/metric grouping in their COVID example, I wasn't aware it could do that!
[21:48:13] <isaacj>	 the covid data is a pretty hefty config file :)
[22:22:13] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add better monitoring for Analytics UIs - https://phabricator.wikimedia.org/T277729 (10razzi) Seeing this error when running puppet on an-tool1010:  ` Error: /Stage[main]/Profile::Superset/File[/usr/local/bin/check_superset_http]: Could not evaluate: Could...
[22:42:43] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10razzi)
[23:03:39] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add better monitoring for Analytics UIs - https://phabricator.wikimedia.org/T277729 (10razzi) Fixed the puppet:/// resource error, now getting curl code 47, too many redirects
[23:06:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add better monitoring for Analytics UIs - https://phabricator.wikimedia.org/T277729 (10razzi) Ok, interesting, the error is happening only on an-tool1010, on an-tool1005 it works fine. I'll roll the check back for now while I look into a command that works...