[00:11:25] PROBLEM - Check the last execution of drop-el-unsanitized-events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:34:41] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Groceryheist) Hey @Ottomata, it turns out that I think stat1006 is a better fit for my purposes since it has ORES dependencies (mainly hunspell) that were missing on the notebook machin... [06:39:27] 10Analytics: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10elukey) I can see three measures now in turnilo! \o/ Nuria how did you delete the segments? With https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Actual_deletion ? [06:39:29] morning! [06:39:38] ---^ turnilo now shows the measures! [06:43:48] 10Analytics: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10elukey) p:05Triage→03Normal [06:45:56] 10Analytics, 10Research: Check home leftovers of ISI researchers - https://phabricator.wikimedia.org/T215775 (10elukey) Ping @leila @Isaac :) [06:56:54] Morning elukey :) [06:57:03] Moar measurez ! [06:57:10] \o/ [07:28:27] ah! [07:28:29] "Exception: Invalid security checksum passed with --execute." [07:28:40] Interesting! [07:28:46] Marcel has been defeated by his own code :D [07:29:07] rightfully, since we added the --skip-trash! [07:30:22] but! [07:30:23] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/533955/2/modules/profile/manifests/analytics/refinery/job/data_purge.pp [07:30:31] seems that the sha were updated [07:30:32] mmmmm [07:31:04] elukey: code always defeats humans - Rightfully shall I say :) [07:32:23] joal: it is also true that another human would need a lot of strenght to defeat marcel in a battle :D [07:32:50] more than fair - This means I'll definitely not take over Marcel's code :) [07:34:06] ahahahah [07:37:11] elukey: I'm investigating T232382 - Very weird stuff [07:37:14] T232382: Discrepancies in Superset Pageview Data - https://phabricator.wikimedia.org/T232382 [07:38:38] :( [07:41:43] ahhhh wait [07:41:56] I think I might know why the drop is failing [07:42:13] the command is passed directly to the ExecStart of systemd [07:42:32] with a big strings of chars that might be interpreted (not as regex) [07:42:49] so the script gets a different set of arguments [07:42:55] and rightfully alerts us [07:44:02] but it is a daily one [07:44:19] so in theory it should have worked before (when the --skip-trash wasn't there) [07:44:22] mmmm [07:47:02] trying to run it on an-coord1001 without the --execute to get the sha [07:57:41] hm - weird [07:57:52] \o/ ! I have found the issue with superset [07:57:59] Man - This thing is not intuitive [07:59:49] what is the issue? [07:59:53] (will read the task :) [08:00:23] elukey: multiple ways to get to the same data, with (small but not neglectable) differences among them [08:00:44] namely: using floatSum instead of doubleSum (or longSum) in druid [08:01:07] ahhhh [08:10:44] 10Analytics, 10Product-Analytics: Discrepancies in Superset Pageview Data - https://phabricator.wikimedia.org/T232382 (10JAllemandou) Thanks for reporting @kzimmerman :) The reason for the discrepancy is the aggregation function of the generated druid query when using superset metric-aggregation. When choosin... [08:11:05] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Discrepancies in Superset Pageview Data - https://phabricator.wikimedia.org/T232382 (10JAllemandou) a:03JAllemandou [08:11:41] ok - one less task [08:13:26] mmm so the sha of the data drop seems ok (15b56cab8d8920a73c0aad0085c6dd36) [08:13:54] elukey: your sha error makes me feel less alone in problems not making sense [08:14:03] (sorry) [08:16:36] joal: I am starting to think that we have a sort of curse, namely when we work together weird things pop up [08:16:39] :D [08:17:38] elukey: Shall I take nickname ScoobyDoo ? [08:18:10] ahahahahah [08:48:46] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) @Cmjohnson should the following descriptions be updated as well with their `an-presto` equivalents?... [08:53:54] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10MoritzMuehlenhoff) The old host definitions for cloudviran are still in debmonitor, puppetdb and site.pp are... [09:06:55] elukey: could we try to dedicate some time to testing the notebooks with kerberos? I think we need Andrew's help on how notebooks are setup with spark [09:07:08] elukey: I think this is the last thing not et tested [09:07:38] joal: sure, what is missing to test them? [09:07:46] I am very ignorant as well but I can take a look [09:08:02] elukey: the configuration to have notebooks-kernel setup to talk to spaek [09:08:51] When using swap, there is a list of kernels showing up allowing to have spark instances - These are the ones I'd like to test as they are the ones mostly used by our swap users [09:09:17] ah ok, and those I suppose are files stored somewhere right? [09:09:28] the kernels I mean [09:09:38] if we find where they are stored, it should be easy enough to copy [09:10:17] I think so yes - They must be part of the files defined in the default jupyter-env [09:13:06] 10Analytics, 10Analytics-Kanban: mediawiki-history-wikitext-coord job fails every month - https://phabricator.wikimedia.org/T228883 (10JAllemandou) Update with investigation results so far: August job failed for dewiki only, with a decompression error: ` java.lang.ArrayIndexOutOfBoundsException: 18002 at org... [09:14:29] elukey: if you don't mind, ca you proof read --^ and confirm it's understandable? [09:16:46] elukey: also, archiva-mirrored still not working :( [09:25:54] sorry got distracted, reading [09:26:00] for archiva, will triple check again [09:26:52] makes sense to me [09:34:07] back ... Uploading jars is painfull :( [10:35:09] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Operations, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) 05Open→03Resolved >>! In T230772#5464981, @Gilles wrote: > This should probably be its own task, though, it's not specific to piwik.js Agreed, I... [10:35:18] 10Analytics, 10Operations, 10Traffic: Cookies and misc services caching - https://phabricator.wikimedia.org/T232453 (10ema) [11:05:37] FYI: I'm bumping the MobileWebUIActionsTracking sampling rate from 1% to 10%. Usually we have ~20 events per minute. New expected rate is ~200. https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=MobileWebUIActionsTracking&from=now-7d&to=now [11:08:05] thanks for letting us know :) [11:08:48] going afk for lunch! [12:20:22] 10Analytics: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10Nuria) I updated docs, i used the old coordinator UI to delete segments: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Delete_segments [12:21:43] 10Analytics: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10Nuria) Now, the new dimensions did not appear imediately [12:25:49] 10Analytics, 10Analytics-Kanban: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10Nuria) [12:54:05] did anything change on archiva? I tried to build prometheus-jmx-exporter on Buster, but it fails with http://paste.debian.net/1099846/ [12:54:14] in fact I can't access https://archiva.wikimedia.org/repository/mirrored/org/sonatype/oss/oss-parent/7/oss-parent-7.pom (times out for me) [12:54:43] Hi moritzm - It's a known issue - see https://phabricator.wikimedia.org/T232456 [12:54:59] ah, thanks. Missed that task [12:55:45] moritzm, joal still seems pretty important no? [12:56:12] nuria: very important indeed - broken since yesterday, but it's complicated [12:56:18] joal: jajajaj [12:56:31] joal: seems an arzel type of problem? [12:56:40] yes m'am [12:56:52] joal: ayayay, ok, let's talk to luca later [13:09:18] nuria: about the systemd failure, this morning I tried to debug it re-running the script without --execute [13:09:23] but the sha seems correct [13:09:29] I forgot to reply to the email [13:09:47] elukey: ok, i will try [13:10:13] I am wondering if systemd interpolates some of the variables in the command, as I always say it is not like running it in bash [13:10:27] systemd does its own interpretation of the execstart command [13:10:49] elukey: ok, will try to understand [13:10:59] elukey: how about archiva? [13:11:07] elukey: did you talk to arzel about it? [13:11:26] nuria: he is not online yet, I opened a task with all the details [13:11:33] elukey: k [13:11:38] https://phabricator.wikimedia.org/T232456 [13:14:21] interestingly, the spark maven repo seems to work [13:14:27] but not cloudera and central [13:21:29] elukey: smells like issues derived from sat changes right? [13:23:13] I had the same thought, but the weird thing is that e.g. a "openssl s_client -connect repo.maven.apache.org:443" fails from eqiad, but works in esams [13:23:22] 10Analytics, 10Analytics-Kanban: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10Nuria) @ayounsi please take a look aand let me know, data sizes on segments vary a lot from 15M some days to 400M in others, does that sound like it makes sense? [13:23:26] and the latter has the same mitigations actively [13:23:41] but the timing is certainly a massive coincidence [13:23:53] maybe the setup is different between eqiad/esams [13:25:51] moritzm: yeah there could be something similar, even if it is strange that it fails during TLS handshake [13:25:58] (so TCP conns are working fine) [13:27:01] yeah, this needs some netops magic [13:27:31] moritzm: and did we by any chance blocked ports in eqiad the other day [13:28:39] 10Analytics, 10Analytics-Kanban: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10elukey) >>! In T232226#5478915, @Nuria wrote: > @ayounsi please take a look aand let me know, data sizes on segments vary a lot from 15M some days to 400M in others, do... [13:29:38] nuria: they were asking for a whitelist of UDP, but I don't really have the insight what was actually applied and what not [13:31:18] moritzm: i see, this is an arzel kind of problem [13:32:23] yeah [13:32:54] elukey: stat1005 has a current uptime of six minutes, was that a planned reboot or fallout from the PDU work? [13:33:08] saw the alerts in Icinga [13:34:12] nono my reboot, I was about to log it [13:34:25] we had a problem with zombie processes after tensorflow tests [13:34:50] done in SAL [13:35:27] ack [13:35:59] (03PS12) 10Fdans: Add cassandra loading job for requests per file metric [analytics/refinery] - 10https://gerrit.wikimedia.org/r/533921 (https://phabricator.wikimedia.org/T228149) [13:37:37] when I looked a few minutes ago, kern.log prior to the reboot was full of Translation Table Maps out-of-memory errors, dunno if that's the cause or fallout of the errors [13:39:04] I didn't check in detail, Miriam said that it was due to some killing of tasks in tensorflow when they were training (there could be a bug in tf-rocm, I am going to upgrade the ROCm stuff to the latest) [13:42:05] (03CR) 10Nuria: "One question" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/533921 (https://phabricator.wikimedia.org/T228149) (owner: 10Fdans) [13:46:52] (03CR) 10Fdans: Add cassandra loading job for requests per file metric (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/533921 (https://phabricator.wikimedia.org/T228149) (owner: 10Fdans) [13:55:31] (03CR) 10Nuria: [C: 04-1] "Couple typos, if we have tested the job once corrected i think it is ready." (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/533921 (https://phabricator.wikimedia.org/T228149) (owner: 10Fdans) [13:59:07] Dropping for kids [13:59:16] (03PS13) 10Fdans: Add cassandra loading job for requests per file metric [analytics/refinery] - 10https://gerrit.wikimedia.org/r/533921 (https://phabricator.wikimedia.org/T228149) [13:59:48] (03CR) 10Fdans: "Please hold until testing is complete (running load test now)" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/533921 (https://phabricator.wikimedia.org/T228149) (owner: 10Fdans) [14:26:45] 10Analytics: Port IRCRecentChanges to Kafka - https://phabricator.wikimedia.org/T232483 (10elukey) p:05Triage→03Normal [14:27:16] nuria: ---^ [14:27:18] 10Analytics, 10Code-Stewardship-Reviews, 10Operations, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10elukey) There is now a task to track the work: T232483 [14:34:17] 10Analytics, 10Analytics-Kanban: Version analytics meta mysql database backup - https://phabricator.wikimedia.org/T231208 (10elukey) Yes they use dedicated slaves only for backup purposes, to avoid locking as you mentioned. I will try to take a snapshot of the whole database to see the amount of time that it t... [14:38:14] I am trying to take a snapshot of all the dbs on an-coord1001 as test [14:38:21] if you see anything weird let me know [14:43:02] 10Analytics, 10Analytics-Kanban: Version analytics meta mysql database backup - https://phabricator.wikimedia.org/T231208 (10elukey) ` root@an-coord1001:/home/elukey# time mysqldump --all-databases > all-dbs-$(date +%s).sql real 2m19.774s user 1m7.260s sys 0m8.064s root@an-coord1001:/home/elukey# du -hs all-... [14:45:24] elukey: this will be ok to merge before dns , ya? [14:45:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/535209 [14:47:11] elukey: k [14:47:34] ottomata: o/ - I'd say yes, but never done it.. In theory it shouldn't be a problem for the puppet master, it is only an indication about what role some hostnames should have [14:48:05] 10Analytics, 10Analytics-Kanban: Version analytics meta mysql database backup - https://phabricator.wikimedia.org/T231208 (10Ottomata) 2-3 mins is not bad! Hopefully each table is locked for much less time than that. I'd be willing to switch to this! We could keep the LVM snapshot for disaster recovery (and... [14:48:10] aye ya [14:52:33] elukey: i must be missing something big time cause i cannot execute the systemd timer code even in shell [14:52:35] elukey: [14:52:45] https://www.irccloud.com/pastebin/xgvc2Mci/ [14:52:58] elukey: any ideas? [14:54:09] nuria: I think that when you sudo -u you loose the PYTHONPATH, I'd prepend it before the /srv/ bits [14:54:33] 10Analytics, 10EventBus, 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10hashar) p:05Triage→03Unbreak! Seems to be in #EventBus? [14:55:42] elukey: right! thank you [14:55:52] 10Analytics, 10EventBus, 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10Reedy) https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/534642/ [14:56:01] 10Analytics, 10EventBus, 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10hashar) Another: ` #0 /srv/mediawiki/php-1.34.0-wmf.22/includes/libs/http/MultiHttpClient.php(423): MWExceptionHandler::... [14:58:17] hm elukey just read https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging, perhaps we need to leave the original mgmt entires in? [14:58:18] 10Analytics, 10MediaWiki-General, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10mobrovac) [14:58:32] 10Analytics, 10MediaWiki-General, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10Reedy) Ping {T232128} [14:59:25] ottomata: ah snap sorry yes, I can re-add them [14:59:33] i can no worried [14:59:34] worries [14:59:44] I didn't read that, just wanted to help :( [14:59:50] you did help thank you! [14:59:51] i just read it [14:59:55] i should have read it before [15:00:23] ah so both mgmt are needed [15:00:25] old and new [15:00:59] yeh i guess so [15:08:07] 10Analytics, 10MediaWiki-General, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10mobrovac) It turns out `CURLMOPT_MA... [15:20:24] (03PS6) 10Mforns: [WIP] Add spark job to create mediawiki history dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/528504 (https://phabricator.wikimedia.org/T208612) [15:26:29] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` ['an-p... [15:34:44] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` cloudv... [15:34:48] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirtan1002.eqiad.wmnet'] ` Of which those **FA... [15:36:05] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` cloudv... [15:39:45] 10Analytics, 10MediaWiki-General, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10Reedy) 05Open→03Resolved a:03Reedy [15:40:04] elukey, I don't understand what you mean regarding checksum [15:40:45] 10Analytics, 10MediaWiki-General, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10hashar) Fixed on production by reverting the faulty patch... [15:42:17] mforns: check my last email [15:42:27] k [15:44:02] mforns: not sure if what I am saying makes sense, but my suspicion is that the drop script is not getting the parameters that we put in ExecStart [15:44:13] I see.. [15:44:23] do we need the two $$ ? [15:44:30] maybe I just pasted the wrong checksum, could be as well [15:44:43] because before the change, the job was running fine with that no? [15:44:45] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` cloudv... [15:44:50] I have re-run it this morning without --execute, and it was good [15:45:23] elukey, the double $$ is for puppet [15:45:31] puppet interprets $$ as $ [15:45:33] in previous logs I can see [15:45:33] 2019-09-10T00:00:01 INFO Unit tests passed. [15:45:33] 2019-09-10T00:00:01 INFO Starting EXECUTION. [15:45:35] no? [15:45:40] aha [15:46:01] elukey, so if you run it without --execute, it returns the same checksum? [15:46:21] I tried this morning and IIRC it was the same [15:46:27] I think that also nuria tried [15:46:36] but if you want to triple check it is good [15:46:48] k [15:47:16] oh... I think I know, one sec [15:47:46] also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/533955/2/modules/profile/manifests/analytics/refinery/job/data_purge.pp looks good, no typos etc.. [15:51:21] elukey, I think the checksum is wrong, because I should have executed the DRY-RUN in the command line with just 1 $ [15:51:27] I did with 2 $$ [15:51:48] that's why, after puppet replaces $$ with $, the checksum matches no more [15:52:12] wow that is trick I'll remember :) --^ [15:52:17] I'm amazed, though, how the regular expression with double $$ still works in the command line [15:52:43] mforns: something is off, since if I do [15:52:44] sudo systemctl cat drop-el-unsanitized-events | grep ExecStart [15:52:48] I get the two $$ [15:52:54] ??? [15:53:10] puppet does not do anything with them, it just adds them as they are [15:53:18] whaat [15:53:31] but the script gets this [15:53:32] '--tables': '^(?!wmdebanner)[A-Za-z0-9]+$', [15:53:33] I googled that! :P [15:53:44] so I think that systemd eats it [15:53:49] well, then my theory is wrong [15:56:09] mforns: I think that what happens is that the dry-run was done with $$, but in reality the scritps gets only $, so the checksum validation fails. I think that you are right, but it is systemd in this case (I think) that causes problems [15:56:18] if this is true though I can't explain why it was working before [15:56:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add spark job to create mediawiki history dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/528504 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [15:57:07] https://www.freedesktop.org/software/systemd/man/systemd.service.html [15:57:18] To pass a literal dollar sign, use "$$". [15:57:21] :P [15:58:51] so mforns, if we want to keep the ExecStart we should leave the $$ in puppet, but run the dryrun with only one $ [15:58:57] and see if it runs fine [16:01:05] aha, makes sense! [16:01:21] ping mforns standdup [16:01:21] will execute the dry run with just 1 $ and update the patch [16:01:31] coming! [16:28:34] 10Analytics, 10Analytics-Kanban: wmf_netflow cube in Turnilo missing bytes and packets measures - https://phabricator.wikimedia.org/T232226 (10ayounsi) A quick look shows everything looking fine. [16:31:05] PROBLEM - yarn.wikimedia.org HTTPS on analytics-tool1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster [16:37:45] had to reboot analytics-tool1001 :( [16:37:53] RECOVERY - yarn.wikimedia.org HTTPS on analytics-tool1001 is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster [16:40:58] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-presto1002.eqiad.wmnet'] ` Of which those **FAI... [16:45:46] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-presto1001.eqiad.wmnet'] ` Of which those **FAI... [17:07:32] mforns: there is a max char limit for the commit msg, jenkins is snarky :) [17:07:51] elukey, sorry, I also forgot the comment we discussed, changing [17:08:15] elukey, what should I do with the commit message, shorten the message or the prefix? [17:12:44] mforns: maybe just say "bla bla: correct checksum" or similar [17:12:48] should be enough [17:12:55] k, done [17:13:15] got a +2 from jenkins :] [17:13:45] lovely, the comment is perfect [17:13:46] merging! [17:17:21] :D [17:18:07] mforns: our dear script still fails for checksum [17:18:18] :((((((((((((((((((((((((((( [17:19:08] fdans: your last patch to aqs is just a rebase [17:19:11] aaaaah! maaan, I forgot the skip trash :C [17:19:23] fixing... [17:19:26] fdans: do please take a look and let me know if you see anything different [17:19:32] mforns: ah! [17:19:44] >C [17:19:45] elukey: i am going to ping arzel on maven ticket ok? [17:20:09] elukey: the maven one that is [17:20:24] elukey, it will take a while now... [17:22:19] nuria: already pinged, he is working on it! [17:22:25] even if it could take a bit [17:22:26] elukey: oohhh ok [17:22:34] he is going to update the task asap [17:22:35] elukey: great, many thnaks [17:47:22] mforns: when the checksum is available lemme know and I'll live-hack-it on an-coord1001 [17:47:30] so we can test before puppet/merging/etc.. [17:47:35] elukey, it just finished the DRY RUN [17:47:50] elukey, dc3b5e020579ae5516b7f372081d1fac [17:50:08] mforns: working! [17:50:10] \o/ [17:53:12] nuria: turns out that committing without staging any of the files doesn't send the changes to gerrit, who knew [17:53:24] (03PS5) 10Fdans: Add per file mediarequests endpoint to AQS [analytics/aqs] - 10https://gerrit.wikimedia.org/r/534824 (https://phabricator.wikimedia.org/T231589) [17:57:53] elukey, :D [17:57:59] will push [17:59:23] RECOVERY - Check the last execution of drop-el-unsanitized-events on an-coord1001 is OK: OK: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:59:53] this was me manually hacking and restarting --^ [18:09:32] mforns: merged! [18:09:39] cool! [18:09:56] we should see now #files decrease slowly [18:16:26] * elukey dinner! o/ [18:38:20] 10Analytics, 10Language-analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Language - https://phabricator.wikimedia.org/T226856 (10kzimmerman) [18:45:06] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Discrepancies in Superset Pageview Data - https://phabricator.wikimedia.org/T232382 (10JAllemandou) Doc updated here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Usage_notes [18:50:26] 10Analytics, 10Performance-Team, 10Product-Analytics, 10Research-Backlog: Switch mw.user.sessionId back to session-cookie persistence - https://phabricator.wikimedia.org/T223931 (10Krinkle) [18:51:21] 10Analytics, 10Better Use Of Data, 10Performance-Team, 10Product-Analytics, 10Research-Backlog: Switch mw.user.sessionId back to session-cookie persistence - https://phabricator.wikimedia.org/T223931 (10jlinehan) [18:54:03] 10Analytics, 10EventBus, 10Core Platform Team (Needs Cleaning - Code Health (TEC13)): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10WDoranWMF) 05Open→03Declined [18:54:11] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 3 others: Modern Event Platform: Stream Intake Service (EventGate): Implementation - https://phabricator.wikimedia.org/T206785 (10WDoranWMF) [18:54:30] 10Analytics, 10EventBus, 10Core Platform Team (Needs Cleaning - Code Health (TEC13)): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10Ottomata) Hm! Why declined? [18:57:26] !log Manually fixed dewiki wikitext for snapshot=2019-07 (snapshot is now full and complete despite oozie error) [18:57:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:58:41] 10Analytics, 10Analytics-Kanban: mediawiki-history-wikitext-coord job fails every month - https://phabricator.wikimedia.org/T228883 (10JAllemandou) Snapshot 2019-07 manually fixed using a manually uncompressed version of the problematic file. Leaving this task open in pause for possible future similar problem. [18:59:28] 10Analytics, 10Analytics-Kanban: mediawiki-history-wikitext-coord job fails every month - https://phabricator.wikimedia.org/T228883 (10JAllemandou) [18:59:43] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Discrepancies in Superset Pageview Data - https://phabricator.wikimedia.org/T232382 (10JAllemandou) [18:59:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Cleanup refinery artifacts folder from unneeded jars - https://phabricator.wikimedia.org/T231856 (10JAllemandou) [19:01:47] 10Analytics, 10EventBus, 10Core Platform Team (Needs Cleaning - Code Health (TEC13)): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10Pchelolo) @Ottomata The newer node-rdkafka seem to already have a bunch of the code we wanted t... [19:02:27] 10Analytics, 10EventBus, 10Core Platform Team (Needs Cleaning - Code Health (TEC13)): Factor lib/kafka.js out of eventgate and change-propagation into its own library - https://phabricator.wikimedia.org/T220725 (10Ottomata) oh cool! [19:10:41] (03CR) 10Nuria: [C: 04-1] "Nice, much better this way. Couple nits but one comment about file names and url encoding that seems significant. File names into cassandr" (035 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/534824 (https://phabricator.wikimedia.org/T231589) (owner: 10Fdans) [19:21:48] You know you're actually not up to the level when you don't know any of the acronyms used in an answer - T232456 [19:25:15] joal: i know! [19:25:43] joal: let's switch the neutron flow of the hardron colider [19:25:50] :D [19:41:53] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Discrepancies in Superset Pageview Data - https://phabricator.wikimedia.org/T232382 (10Nuria) 05Open→03Resolved [19:42:33] Gone for tonight team - Tomorrow is kids day, I'll be on at siesta and in the evening :) [19:44:45] 10Analytics: Superset throwing up performance errors - https://phabricator.wikimedia.org/T231614 (10Nuria) Closing, rather than a bug is a performance limitation, data is too large to be split by browser family for all families for 4 years. Filtering browsing families or limiting by timespan of pageviews should... [19:45:04] 10Analytics: Superset throwing up performance errors - https://phabricator.wikimedia.org/T231614 (10Nuria) 05Open→03Resolved [19:50:59] 10Analytics, 10Analytics-Kanban: Sqoop: remove cuc_comment and join to comment table - https://phabricator.wikimedia.org/T217848 (10Nuria) cu_changes is always scooped from production right? [19:53:33] (03CR) 10Nuria: [C: 03+1] "Is there anything preventing us to merge these changes?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/534611 (https://phabricator.wikimedia.org/T231856) (owner: 10Joal) [20:13:44] joal, ottomata : i think that there are still analytics jobs running on default queue that should probably be moved to production queue [20:16:35] ottomata: are we deprecating camus-eventbus? [20:17:58] nuria: we should probably refactor it a little bit [20:18:01] rename it [20:18:05] but it is the one importing the mediawiki events [20:18:26] ottomata: k, can you answer arzel on this ticket: https://phabricator.wikimedia.org/T232456 [20:18:50] ottomata: i think host is the archiva one but I am not sure if I understand teh question [20:20:11] i think that's right nuria [20:32:59] ottomata: i moved two camu jobs to production queue as they were on default queue [20:33:44] oh interesting [20:33:45] thanks [20:33:47] which ones nuria ? [20:33:54] https://www.irccloud.com/pastebin/LhIfeoJd/ [20:36:09] 10Analytics, 10CheckUser, 10Core Platform Team: Refactor Comment fields for CheckUser Component - https://phabricator.wikimedia.org/T232531 (10WDoranWMF) [20:37:14] 10Analytics, 10Analytics-Kanban: Sqoop: remove cuc_comment and join to comment table - https://phabricator.wikimedia.org/T217848 (10Nuria) The task to follow is https://phabricator.wikimedia.org/T232531 that already ccs analytics, that refactor has not started yet [20:37:53] 10Analytics, 10CheckUser, 10Core Platform Team: Refactor Comment fields for CheckUser Component - https://phabricator.wikimedia.org/T232531 (10WDoranWMF) [21:10:45] 10Analytics: Sqoop: remove cuc_comment and join to comment table - https://phabricator.wikimedia.org/T217848 (10Nuria) [21:34:35] !log restarting archiva service on archiva1001 [21:34:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:01:41] PROBLEM - yarn.wikimedia.org HTTPS on analytics-tool1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster [22:03:43] PROBLEM - Hue CherryPy python server on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [22:13:19] RECOVERY - yarn.wikimedia.org HTTPS on analytics-tool1001 is OK: HTTP OK: HTTP/1.1 200 OK - 248 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster [22:13:43] ok, could not ssh but hey, looks like it is fixed now [22:29:55] RECOVERY - Hue CherryPy python server on analytics-tool1001 is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue runcherrypyserver https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [23:23:09] 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) I just though i can easily setup turnilo to decode tcp_flags so they are not ints, let me give it a try [23:32:17] 10Analytics, 10MediaWiki-General, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.34-notes (1.34.0-wmf.23; 2019-09-17), 10Wikimedia-production-error: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 (10aaron) Odd...