Jekyll2022-07-21T08:43:30+00:00https://albertotb.github.io/feed.xmlForget me notMachine learning at KomorebiAlberto Torres Barránalbertotb@gmail.comGit prompt with conda and conda-auto-env2020-02-21T00:00:00+00:002020-02-21T00:00:00+00:00https://albertotb.github.io/Git-prompt-with-conda-and-conda-auto-env<p>The Git Team maintains a bash script that sets a message in your prompt displaying the current branch and status. The script can be found <a href="https://github.com/git/git/tree/master/contrib/completion">here</a>. To install the script, I have modified the instructions from this <a href="https://digitalfortress.tech/tutorial/setting-up-git-prompt-step-by-step/">tutorial</a> to make it work with conda and the conda-auto-env tool.</p>
<p>First, we are going to assume that conda is install in our system and we will add conda-auto-env. This tools automatically activates the conda environment every time we <code class="language-plaintext highlighter-rouge">cd</code> into a folder that has a <code class="language-plaintext highlighter-rouge">env.yml</code> or <code class="language-plaintext highlighter-rouge">environment.yml</code> files. There are many versions but in this post I will use the one from <a href="https://janosh.io/blog/conda-auto-env/">here</a> modified for bash. Download the file <a href="https://raw.githubusercontent.com/albertotb/git-conda-prompt/master/conda_auto_env.sh"><code class="language-plaintext highlighter-rouge">conda_auto_env.sh</code></a> to any location in your home folder, for instance <code class="language-plaintext highlighter-rouge">~/scripts</code>. Then download the script <a href="https://raw.githubusercontent.com/git/git/master/contrib/completion/git-prompt.sh"><code class="language-plaintext highlighter-rouge">git-prompt.sh</code></a> to the same location and add the following to the <strong>end</strong> of your <code class="language-plaintext highlighter-rouge">~/.bashrc</code> file:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">GREEN</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[38;5;155m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">DARK_GREEN</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[00;32m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">GRAY</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[38;5;8m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">ORANGE</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[38;5;220m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">BLUE</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[38;5;117m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">WHITE</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[38;5;15m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">YELLOW</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[01;33m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">LIGHT_GRAY</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[0;37m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">CYAN</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[0;36m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">RED</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[0;31m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">VIOLET</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[01;35m</span><span class="se">\]</span><span class="s2">"</span>
<span class="nv">MAGENTA</span><span class="o">=</span><span class="s2">"</span><span class="se">\[\0</span><span class="s2">33[0;35m</span><span class="se">\]</span><span class="s2">"</span><span class="nb">.</span>
<span class="nv">RESET</span><span class="o">=</span><span class="s2">"</span><span class="se">\[</span><span class="si">$(</span>tput sgr0<span class="si">)</span><span class="se">\]</span><span class="s2">"</span>
<span class="k">function </span>git_and_conda_prompt <span class="o">{</span>
<span class="nb">local </span><span class="nv">__git_branch_color</span><span class="o">=</span><span class="s2">"</span><span class="nv">$DARK_GREEN</span><span class="s2">"</span>
<span class="nb">local </span><span class="nv">__git_branch</span><span class="o">=</span><span class="si">$(</span>__git_ps1 <span class="s1">' [%s]'</span><span class="si">)</span><span class="p">;</span>
<span class="c"># colour branch name depending on state
</span>
<span class="k">if</span> <span class="o">[[</span> <span class="s2">"</span><span class="k">${</span><span class="nv">__git_branch</span><span class="k">}</span><span class="s2">"</span> <span class="o">=</span>~ <span class="s2">"*"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then</span> <span class="c"># if repository is dirty
</span>
<span class="nv">__git_branch_color</span><span class="o">=</span><span class="s2">"</span><span class="nv">$RED</span><span class="s2">"</span>
<span class="k">elif</span> <span class="o">[[</span> <span class="s2">"</span><span class="k">${</span><span class="nv">__git_branch</span><span class="k">}</span><span class="s2">"</span> <span class="o">=</span>~ <span class="s2">"$"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then</span> <span class="c"># if there is something stashed
</span>
<span class="nv">__git_branch_color</span><span class="o">=</span><span class="s2">"</span><span class="nv">$YELLOW</span><span class="s2">"</span>
<span class="k">elif</span> <span class="o">[[</span> <span class="s2">"</span><span class="k">${</span><span class="nv">__git_branch</span><span class="k">}</span><span class="s2">"</span> <span class="o">=</span>~ <span class="s2">"%"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then</span> <span class="c"># if there are only untracked files
</span>
<span class="nv">__git_branch_color</span><span class="o">=</span><span class="s2">"</span><span class="nv">$LIGHT_GRAY</span><span class="s2">"</span>
<span class="k">elif</span> <span class="o">[[</span> <span class="s2">"</span><span class="k">${</span><span class="nv">__git_branch</span><span class="k">}</span><span class="s2">"</span> <span class="o">=</span>~ <span class="s2">"+"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then</span> <span class="c"># if there are staged files
</span>
<span class="nv">__git_branch_color</span><span class="o">=</span><span class="s2">"</span><span class="nv">$CYAN</span><span class="s2">"</span>
<span class="k">fi
</span><span class="nv">PS1</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">CONDA_PROMPT_MODIFIER</span><span class="k">}${</span><span class="nv">GREEN</span><span class="k">}</span><span class="se">\u</span><span class="k">${</span><span class="nv">RESET</span><span class="k">}${</span><span class="nv">GRAY</span><span class="k">}</span><span class="s2">@</span><span class="k">${</span><span class="nv">RESET</span><span class="k">}${</span><span class="nv">ORANGE</span><span class="k">}</span><span class="se">\h</span><span class="k">${</span><span class="nv">RESET</span><span class="k">}${</span><span class="nv">GRAY</span><span class="k">}</span><span class="s2">:</span><span class="k">${</span><span class="nv">RESET</span><span class="k">}${</span><span class="nv">BLUE</span><span class="k">}</span><span class="se">\w</span><span class="k">${</span><span class="nv">RESET</span><span class="k">}</span><span class="nv">$__git_branch_color$__git_branch</span><span class="k">${</span><span class="nv">GRAY</span><span class="k">}</span><span class="se">\$</span><span class="k">${</span><span class="nv">RESET</span><span class="k">}${</span><span class="nv">WHITE</span><span class="k">}${</span><span class="nv">RESET</span><span class="k">}</span><span class="s2"> "</span>
<span class="o">}</span>
<span class="nb">export </span><span class="nv">PROMPT_COMMAND</span><span class="o">=</span><span class="s2">"conda_auto_env;git_and_conda_prompt"</span>
<span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> ~/scripts/conda_auto_env.sh <span class="o">]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">source</span> ~/scripts/conda_auto_env.sh
<span class="k">fi</span>
<span class="c"># if .git-prompt.sh exists, set options and execute it
</span>
<span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> ~/scripts/git-prompt.sh <span class="o">]</span><span class="p">;</span> <span class="k">then
</span><span class="nv">GIT_PS1_SHOWDIRTYSTATE</span><span class="o">=</span><span class="nb">true
</span><span class="nv">GIT_PS1_SHOWSTASHSTATE</span><span class="o">=</span><span class="nb">true
</span><span class="nv">GIT_PS1_SHOWUNTRACKEDFILES</span><span class="o">=</span><span class="nb">true
</span><span class="nv">GIT_PS1_SHOWUPSTREAM</span><span class="o">=</span><span class="s2">"auto"</span>
<span class="nv">GIT_PS1_HIDE_IF_PWD_IGNORED</span><span class="o">=</span><span class="nb">true
</span><span class="nv">GIT_PS1_SHOWCOLORHINTS</span><span class="o">=</span><span class="nb">true
source</span> ~/scripts/git-prompt.sh
<span class="k">fi</span></code></pre></figure>
<p>Let us explain what the previous code does:</p>
<ol>
<li>Define colors to use for prompt, this will make the <code class="language-plaintext highlighter-rouge">PS1</code> variable easier to modify</li>
<li>Define a function <code class="language-plaintext highlighter-rouge">git_and_conda_prompt</code>. We get the git status from <code class="language-plaintext highlighter-rouge">__git_ps1</code> (provided by <code class="language-plaintext highlighter-rouge">git-prompt.sh</code>) and color the name of the repository according to the status:
<ul>
<li><code class="language-plaintext highlighter-rouge">*</code> unstaged files</li>
<li><code class="language-plaintext highlighter-rouge">$</code> stashed files</li>
<li><code class="language-plaintext highlighter-rouge">%</code> untracked files</li>
<li><code class="language-plaintext highlighter-rouge">+</code> uncommited files</li>
</ul>
</li>
<li>Build the <code class="language-plaintext highlighter-rouge">PS1</code> variable, this line can be customized by changing the information and colors:
<ul>
<li><code class="language-plaintext highlighter-rouge">$CONDA_PROMPT_MODIFIER</code>, current conda environment</li>
<li><code class="language-plaintext highlighter-rouge">\u</code>: username</li>
<li><code class="language-plaintext highlighter-rouge">\h</code>: hostname</li>
<li><code class="language-plaintext highlighter-rouge">\w</code>: working directory</li>
</ul>
</li>
<li>Add <code class="language-plaintext highlighter-rouge">conda_auto_env</code> (provided by <code class="language-plaintext highlighter-rouge">conda_auto_env.sh</code>) and <code class="language-plaintext highlighter-rouge">git_and_conda_prompt</code> to the <code class="language-plaintext highlighter-rouge">$PROMPT_COMMAND</code> variable. This variable will be executed just before displaying the prompt. The order is important, since we want to activate the environment first (if any) and then display the prompt with all the information.</li>
</ol>
<p>The variable <a href="https://github.com/conda/conda/issues/1070"><code class="language-plaintext highlighter-rouge">$CONDA_PROMPT_MODIFIER</code></a> is set by <code class="language-plaintext highlighter-rouge">conda activate</code> and contains the name of the current environment between <code class="language-plaintext highlighter-rouge">()</code>. <code class="language-plaintext highlighter-rouge">conda init</code> already shows this information in the prompt by setting the <code class="language-plaintext highlighter-rouge">PS1</code> variable, however we have to add it manually to since we are overriding <code class="language-plaintext highlighter-rouge">PS1</code> in the function <code class="language-plaintext highlighter-rouge">git_and_conda_prompt</code>.</p>Alberto Torres Barránalbertotb@gmail.comThe Git Team maintains a bash script that sets a message in your prompt displaying the current branch and status. The script can be found here. To install the script, I have modified the instructions from this tutorial to make it work with conda and the conda-auto-env tool. First, we are going to assume that conda is install in our system and we will add conda-auto-env. This tools automatically activates the conda environment every time we cd into a folder that has a env.yml or environment.yml files. There are many versions but in this post I will use the one from here modified for bash. Download the file conda_auto_env.sh to any location in your home folder, for instance ~/scripts. Then download the script git-prompt.sh to the same location and add the following to the end of your ~/.bashrc file: GREEN="\[\033[38;5;155m\]" DARK_GREEN="\[\033[00;32m\]" GRAY="\[\033[38;5;8m\]" ORANGE="\[\033[38;5;220m\]" BLUE="\[\033[38;5;117m\]" WHITE="\[\033[38;5;15m\]" YELLOW="\[\033[01;33m\]" LIGHT_GRAY="\[\033[0;37m\]" CYAN="\[\033[0;36m\]" RED="\[\033[0;31m\]" VIOLET="\[\033[01;35m\]" MAGENTA="\[\033[0;35m\]". RESET="\[$(tput sgr0)\]" function git_and_conda_prompt { local __git_branch_color="$DARK_GREEN" local __git_branch=$(__git_ps1 ' [%s]'); # colour branch name depending on state if [[ "${__git_branch}" =~ "*" ]]; then # if repository is dirty __git_branch_color="$RED" elif [[ "${__git_branch}" =~ "$" ]]; then # if there is something stashed __git_branch_color="$YELLOW" elif [[ "${__git_branch}" =~ "%" ]]; then # if there are only untracked files __git_branch_color="$LIGHT_GRAY" elif [[ "${__git_branch}" =~ "+" ]]; then # if there are staged files __git_branch_color="$CYAN" fi PS1="${CONDA_PROMPT_MODIFIER}${GREEN}\u${RESET}${GRAY}@${RESET}${ORANGE}\h${RESET}${GRAY}:${RESET}${BLUE}\w${RESET}$__git_branch_color$__git_branch${GRAY}\$${RESET}${WHITE}${RESET} " } export PROMPT_COMMAND="conda_auto_env;git_and_conda_prompt" if [ -f ~/scripts/conda_auto_env.sh ]; then source ~/scripts/conda_auto_env.sh fi # if .git-prompt.sh exists, set options and execute it if [ -f ~/scripts/git-prompt.sh ]; then GIT_PS1_SHOWDIRTYSTATE=true GIT_PS1_SHOWSTASHSTATE=true GIT_PS1_SHOWUNTRACKEDFILES=true GIT_PS1_SHOWUPSTREAM="auto" GIT_PS1_HIDE_IF_PWD_IGNORED=true GIT_PS1_SHOWCOLORHINTS=true source ~/scripts/git-prompt.sh fi Let us explain what the previous code does: Define colors to use for prompt, this will make the PS1 variable easier to modify Define a function git_and_conda_prompt. We get the git status from __git_ps1 (provided by git-prompt.sh) and color the name of the repository according to the status: * unstaged files $ stashed files % untracked files + uncommited files Build the PS1 variable, this line can be customized by changing the information and colors: $CONDA_PROMPT_MODIFIER, current conda environment \u: username \h: hostname \w: working directory Add conda_auto_env (provided by conda_auto_env.sh) and git_and_conda_prompt to the $PROMPT_COMMAND variable. This variable will be executed just before displaying the prompt. The order is important, since we want to activate the environment first (if any) and then display the prompt with all the information. The variable $CONDA_PROMPT_MODIFIER is set by conda activate and contains the name of the current environment between (). conda init already shows this information in the prompt by setting the PS1 variable, however we have to add it manually to since we are overriding PS1 in the function git_and_conda_prompt.Benchmark adding together multiple columns in dplyr2019-02-05T00:00:00+00:002019-02-05T00:00:00+00:00https://albertotb.github.io/Benchmark-adding-together-multiple-columns-in-dplyr<p>Inspired partly by <a href="https://stackoverflow.com/questions/47759347/create-a-new-column-which-is-the-sum-of-specific-columns-selected-by-their-name/">this</a> and <a href="https://stackoverflow.com/questions/28873057/sum-across-multiple-columns-with-dplyr/">this</a> Stackoverflow questions, I wanted to test what is the fastest way to create a new column using <code class="language-plaintext highlighter-rouge">dplyr</code> as a combination of others.</p>
<p>First, let’s create some example data</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">readr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">microbenchmark</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1234</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000000</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">6</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">sample</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="o">*</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w">
</span><span class="n">dimnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">str_pad</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">str_length</span><span class="p">(</span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">pad</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"0"</span><span class="p">),</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">,</span><span class="w"> </span><span class="s2">"C"</span><span class="p">,</span><span class="w"> </span><span class="s2">"D"</span><span class="p">,</span><span class="w"> </span><span class="s2">"E"</span><span class="p">,</span><span class="w"> </span><span class="s2">"F"</span><span class="p">)))</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">rownames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'index'</span><span class="p">)</span></code></pre></figure>
<p>We have a data frame with 6 binary columns, and we want to create another one which is the sum of these columns. The most straighforward way is using <code class="language-plaintext highlighter-rouge">mutate()</code> directly</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mutate</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">C</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">D</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 1,000,000 x 8
## index A B C D E F total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 00001 0 0 0 1 1 0 2
## 2 00002 1 0 0 0 0 1 2
## 3 00003 1 1 0 1 0 1 4
## 4 00004 1 0 0 1 0 0 2
## 5 00005 1 0 0 1 1 0 3
## 6 00006 1 1 1 0 0 1 4
## 7 00007 0 1 0 0 1 1 3
## 8 00008 0 0 0 0 1 0 1
## 9 00009 1 0 1 1 1 1 5
## 10 00010 1 1 0 0 0 0 2
## # ... with 999,990 more rows</code></pre></figure>
<p>This is probably going to be very fast, since it takes full advantage of R vectorized operations. The downside is that if we want to sum up say, 20 columns, we have to write down the name of all of them.</p>
<p>The second approach is to use tidy data principles to transform the previous data frame into long form and then perform the operation by group:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">value</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 1,000,000 x 2
## index total
## <chr> <dbl>
## 1 00001 2
## 2 00002 2
## 3 00003 4
## 4 00004 2
## 5 00005 3
## 6 00006 4
## 7 00007 3
## 8 00008 1
## 9 00009 5
## 10 00010 2
## # ... with 999,990 more rows</code></pre></figure>
<p>The downside of this approach is that we have as many groups as rows in the original data frame, and usually grouped operations are not very efficient when the number of groups is very large. Of course, depending on the meaning of the columns “A”, “B”, etc. the data frame <code class="language-plaintext highlighter-rouge">df</code> may not be a tidy dataset, and it is always a good idea to transform those using tidy data principles. However, it also may already be in tidy form.</p>
<p>The next possibility is to iterate over the rows of the original data, summing them up. Here we can use the functions <code class="language-plaintext highlighter-rouge">apply()</code> or <code class="language-plaintext highlighter-rouge">rowSums()</code> from base R and <code class="language-plaintext highlighter-rouge">pmap()</code> from the <code class="language-plaintext highlighter-rouge">purrr</code> package.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mutate</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rowSums</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">)))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 1,000,000 x 8
## index A B C D E F total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 00001 0 0 0 1 1 0 2
## 2 00002 1 0 0 0 0 1 2
## 3 00003 1 1 0 1 0 1 4
## 4 00004 1 0 0 1 0 0 2
## 5 00005 1 0 0 1 1 0 3
## 6 00006 1 1 1 0 0 1 4
## 7 00007 0 1 0 0 1 1 3
## 8 00008 0 0 0 0 1 0 1
## 9 00009 1 0 1 1 1 1 5
## 10 00010 1 1 0 0 0 0 2
## # ... with 999,990 more rows</code></pre></figure>
<p>These function perform the same operation but differ in many aspects:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">apply()</code> coerces the data frame into a matrix, so care needs to be taken with non-numeric columns.</li>
<li><code class="language-plaintext highlighter-rouge">rowSums()</code> can only be used if we want to perform the sum or the mean (<code class="language-plaintext highlighter-rouge">rowMeans()</code>), but not for other operations.</li>
<li><code class="language-plaintext highlighter-rouge">pmap()</code> has variants that let you specifiy the type of the output (<code class="language-plaintext highlighter-rouge">pmap_dbl()</code>, <code class="language-plaintext highlighter-rouge">pmap_lgl()</code>) and thus are safer. If the output cannot be coerced to the given type an exception will be thrown.</li>
</ul>
<p>Finally, we have the <code class="language-plaintext highlighter-rouge">reduce()</code> function from the <code class="language-plaintext highlighter-rouge">purrr</code> package (see <a href="https://adv-r.hadley.nz/functionals.html#reduce">this</a> chapter from “Advanced R” by Hadley Wickham to learn more). This function lets us take full advantage of R vectorized operation and write the operation very concisely, whether it be 6 or 20 columns.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mutate</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reduce</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">),</span><span class="w"> </span><span class="n">`+`</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 1,000,000 x 8
## index A B C D E F total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 00001 0 0 0 1 1 0 2
## 2 00002 1 0 0 0 0 1 2
## 3 00003 1 1 0 1 0 1 4
## 4 00004 1 0 0 1 0 0 2
## 5 00005 1 0 0 1 1 0 3
## 6 00006 1 1 1 0 0 1 4
## 7 00007 0 1 0 0 1 1 3
## 8 00008 0 0 0 0 1 0 1
## 9 00009 1 0 1 1 1 1 5
## 10 00010 1 1 0 0 0 0 2
## # ... with 999,990 more rows</code></pre></figure>
<p>We can measure the running time of every snippet of code using the package <code class="language-plaintext highlighter-rouge">microbenchmark</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">check_equal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">values</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">all</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">values</span><span class="p">[</span><span class="m">-1</span><span class="p">],</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">all_equal</span><span class="p">(</span><span class="n">values</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="w"> </span><span class="n">x</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">microbenchmark</span><span class="p">(</span><span class="w">
</span><span class="s2">"vectorized"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">C</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">D</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">index</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"gather"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">index</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">value</span><span class="p">))</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"pmap"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pmap_dbl</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">),</span><span class="w"> </span><span class="n">sum</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">index</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"rowSums"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rowSums</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">index</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"apply"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">index</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"reduce"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reduce</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">index</span><span class="p">),</span><span class="w"> </span><span class="n">`+`</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">index</span><span class="p">,</span><span class="w"> </span><span class="n">total</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="n">check</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">check_equal</span><span class="p">,</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">print</span><span class="p">(</span><span class="n">bm</span><span class="p">,</span><span class="w"> </span><span class="n">order</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'median'</span><span class="p">,</span><span class="w"> </span><span class="n">signif</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Unit: milliseconds
## expr min lq mean median uq max neval cld
## vectorized 8.52 8.77 10.2 10.2 11.2 12.9 10 a
## reduce 11.80 12.30 20.7 16.5 18.2 64.3 10 a
## rowSums 35.70 38.30 46.1 42.3 42.8 90.9 10 a
## apply 1520.00 1740.00 1850.0 1800.0 2020.0 2360.0 10 b
## pmap 4770.00 5010.00 5230.0 5200.0 5410.0 5810.0 10 c
## gather 12800.00 13100.00 14000.0 13600.0 14300.0 17200.0 10 d</code></pre></figure>
<p>The results are mostly as expected. The vectorized code is the fastest, but it is not very concise. The <code class="language-plaintext highlighter-rouge">reduce()</code> function is also very fast, and can be used with any number of columns. The slowest is the <code class="language-plaintext highlighter-rouge">gather()</code>approach, and it should probably be avoided unless you already need to tidy your data.</p>
<p>Two things were really surprising:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">rowSums()</code> is much faster than <code class="language-plaintext highlighter-rouge">apply()</code> and almost as good as <code class="language-plaintext highlighter-rouge">reduce()</code>. As mentioned before it can only be used when computing the sum or the mean.</li>
<li><code class="language-plaintext highlighter-rouge">apply()</code> is twice as fast as <code class="language-plaintext highlighter-rouge">pmap_dbl()</code>, probably because of the extra checks needed by <code class="language-plaintext highlighter-rouge">pmap()</code>. However, I would expect them to be much closer.</li>
</ul>
<p>We end this post with a violin plot of the results:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">autoplot</span><span class="p">(</span><span class="n">bm</span><span class="p">)</span></code></pre></figure>
<p><img src="../assets/images/unnamed-chunk-20-1.png" alt="plot of chunk unnamed-chunk-20" /></p>Alberto Torres Barránalbertotb@gmail.comInspired partly by this and this Stackoverflow questions, I wanted to test what is the fastest way to create a new column using dplyr as a combination of others. First, let’s create some example data library(tidyr) library(dplyr) library(tibble) library(stringr) library(purrr) library(readr) library(microbenchmark) set.seed(1234) n <- 1000000 d <- 6 m <- matrix(sample(c(0, 1), size = n*d, replace = TRUE), n, d, dimnames = list(str_pad(1:n, str_length(n), pad = "0"), c("A", "B", "C", "D", "E", "F"))) df <- as_tibble(m, rownames = 'index') We have a data frame with 6 binary columns, and we want to create another one which is the sum of these columns. The most straighforward way is using mutate() directly mutate(df, total = A + B + C + D + E + F) ## # A tibble: 1,000,000 x 8 ## index A B C D E F total ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 00001 0 0 0 1 1 0 2 ## 2 00002 1 0 0 0 0 1 2 ## 3 00003 1 1 0 1 0 1 4 ## 4 00004 1 0 0 1 0 0 2 ## 5 00005 1 0 0 1 1 0 3 ## 6 00006 1 1 1 0 0 1 4 ## 7 00007 0 1 0 0 1 1 3 ## 8 00008 0 0 0 0 1 0 1 ## 9 00009 1 0 1 1 1 1 5 ## 10 00010 1 1 0 0 0 0 2 ## # ... with 999,990 more rows This is probably going to be very fast, since it takes full advantage of R vectorized operations. The downside is that if we want to sum up say, 20 columns, we have to write down the name of all of them. The second approach is to use tidy data principles to transform the previous data frame into long form and then perform the operation by group: df %>% gather(key, value, -index) %>% group_by(index) %>% summarize(total = sum(value)) ## # A tibble: 1,000,000 x 2 ## index total ## <chr> <dbl> ## 1 00001 2 ## 2 00002 2 ## 3 00003 4 ## 4 00004 2 ## 5 00005 3 ## 6 00006 4 ## 7 00007 3 ## 8 00008 1 ## 9 00009 5 ## 10 00010 2 ## # ... with 999,990 more rows The downside of this approach is that we have as many groups as rows in the original data frame, and usually grouped operations are not very efficient when the number of groups is very large. Of course, depending on the meaning of the columns “A”, “B”, etc. the data frame df may not be a tidy dataset, and it is always a good idea to transform those using tidy data principles. However, it also may already be in tidy form. The next possibility is to iterate over the rows of the original data, summing them up. Here we can use the functions apply() or rowSums() from base R and pmap() from the purrr package. mutate(df, total = rowSums(select(df, -index))) ## # A tibble: 1,000,000 x 8 ## index A B C D E F total ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 00001 0 0 0 1 1 0 2 ## 2 00002 1 0 0 0 0 1 2 ## 3 00003 1 1 0 1 0 1 4 ## 4 00004 1 0 0 1 0 0 2 ## 5 00005 1 0 0 1 1 0 3 ## 6 00006 1 1 1 0 0 1 4 ## 7 00007 0 1 0 0 1 1 3 ## 8 00008 0 0 0 0 1 0 1 ## 9 00009 1 0 1 1 1 1 5 ## 10 00010 1 1 0 0 0 0 2 ## # ... with 999,990 more rows These function perform the same operation but differ in many aspects: apply() coerces the data frame into a matrix, so care needs to be taken with non-numeric columns. rowSums() can only be used if we want to perform the sum or the mean (rowMeans()), but not for other operations. pmap() has variants that let you specifiy the type of the output (pmap_dbl(), pmap_lgl()) and thus are safer. If the output cannot be coerced to the given type an exception will be thrown. Finally, we have the reduce() function from the purrr package (see this chapter from “Advanced R” by Hadley Wickham to learn more). This function lets us take full advantage of R vectorized operation and write the operation very concisely, whether it be 6 or 20 columns. mutate(df, total = reduce(select(df, -index), `+`)) ## # A tibble: 1,000,000 x 8 ## index A B C D E F total ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 00001 0 0 0 1 1 0 2 ## 2 00002 1 0 0 0 0 1 2 ## 3 00003 1 1 0 1 0 1 4 ## 4 00004 1 0 0 1 0 0 2 ## 5 00005 1 0 0 1 1 0 3 ## 6 00006 1 1 1 0 0 1 4 ## 7 00007 0 1 0 0 1 1 3 ## 8 00008 0 0 0 0 1 0 1 ## 9 00009 1 0 1 1 1 1 5 ## 10 00010 1 1 0 0 0 0 2 ## # ... with 999,990 more rows We can measure the running time of every snippet of code using the package microbenchmark. check_equal <- function(values) { all(sapply(values[-1], function(x) all_equal(values[[1]], x))) } bm <- microbenchmark( "vectorized" = { df %>% mutate(total = A + B + C + D + E + F) %>% select(index, total) }, "gather" = { df %>% gather(key, value, -index) %>% group_by(index) %>% summarize(total = sum(value)) }, "pmap" = { df %>% mutate(total = pmap_dbl(select(., -index), sum)) %>% select(index, total) }, "rowSums" = { df %>% mutate(total = rowSums(select(., -index))) %>% select(index, total) }, "apply" = { df %>% mutate(total = apply(select(., -index), 1, sum)) %>% select(index, total) }, "reduce" = { df %>% mutate(total = reduce(select(df, -index), `+`)) %>% select(index, total) }, check = check_equal, times = 10 ) print(bm, order = 'median', signif = 3) ## Unit: milliseconds ## expr min lq mean median uq max neval cld ## vectorized 8.52 8.77 10.2 10.2 11.2 12.9 10 a ## reduce 11.80 12.30 20.7 16.5 18.2 64.3 10 a ## rowSums 35.70 38.30 46.1 42.3 42.8 90.9 10 a ## apply 1520.00 1740.00 1850.0 1800.0 2020.0 2360.0 10 b ## pmap 4770.00 5010.00 5230.0 5200.0 5410.0 5810.0 10 c ## gather 12800.00 13100.00 14000.0 13600.0 14300.0 17200.0 10 d The results are mostly as expected. The vectorized code is the fastest, but it is not very concise. The reduce() function is also very fast, and can be used with any number of columns. The slowest is the gather()approach, and it should probably be avoided unless you already need to tidy your data. Two things were really surprising: rowSums() is much faster than apply() and almost as good as reduce(). As mentioned before it can only be used when computing the sum or the mean. apply() is twice as fast as pmap_dbl(), probably because of the extra checks needed by pmap(). However, I would expect them to be much closer. We end this post with a violin plot of the results: library(ggplot2) autoplot(bm)Compute correlations using the tidyverse2019-01-28T00:00:00+00:002019-01-28T00:00:00+00:00https://albertotb.github.io/Compute-correlations-using-the-tidyverse<p>This small example aims to provide some use cases for the <code class="language-plaintext highlighter-rouge">tidyr</code> package. Let’s generate some example data first:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">forcats</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1234</span><span class="p">)</span><span class="w">
</span><span class="n">sales</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ymd</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">20180101</span><span class="p">,</span><span class="w"> </span><span class="m">20180102</span><span class="p">,</span><span class="w"> </span><span class="m">20180103</span><span class="p">),</span><span class="w"> </span><span class="m">3</span><span class="p">)),</span><span class="w">
</span><span class="n">product</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">,</span><span class="w"> </span><span class="s2">"C"</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w">
</span><span class="n">sales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="n">sales</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 9 x 3
## date product sales
## <date> <chr> <int>
## 1 2018-01-01 A 3
## 2 2018-01-02 A 13
## 3 2018-01-03 A 13
## 4 2018-01-01 B 13
## 5 2018-01-02 B 18
## 6 2018-01-03 B 13
## 7 2018-01-01 C 1
## 8 2018-01-02 C 5
## 9 2018-01-03 C 14</code></pre></figure>
<p>We want to compute the correlation of the sales from products A, B and C. The base R function <code class="language-plaintext highlighter-rouge">cor()</code> takes a matrix or data.frame and computes the correlation between all the column pairs. Thus, first we need to convert the data.frame <code class="language-plaintext highlighter-rouge">sales</code>, which is in long form, to wide form with one column per product.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">cor_matrix</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">sales</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">spread</span><span class="p">(</span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">product</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sales</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">cor</span><span class="p">()</span><span class="w">
</span><span class="n">cor_matrix</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## A B C
## A 1.0000000 0.5000000 0.7370435
## B 0.5000000 1.0000000 -0.2167775
## C 0.7370435 -0.2167775 1.0000000</code></pre></figure>
<p>To manipulate the correlation matrix using <code class="language-plaintext highlighter-rouge">tidyverse</code>-related functions we need to convert back the previous matrix to a long data.frame:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">cor_tidy</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">cor_matrix</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rownames_to_column</span><span class="p">(</span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"product1"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">product2</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corr</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">product1</span><span class="p">)</span><span class="w">
</span><span class="n">cor_tidy</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## product1 product2 corr
## 1 A A 1.0000000
## 2 B A 0.5000000
## 3 C A 0.7370435
## 4 A B 0.5000000
## 5 B B 1.0000000
## 6 C B -0.2167775
## 7 A C 0.7370435
## 8 B C -0.2167775
## 9 C C 1.0000000</code></pre></figure>
<p>Now we can plot the correlation matrix using ggplot2, for instance with a heatmap:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">cor_tidy</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">product1</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">product2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corr</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_tile</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_gradient2</span><span class="p">(</span><span class="n">limits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span></code></pre></figure>
<p><img src="../assets/images/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" width="80%" /></p>
<p>Another common way of representing correlation is a vertical barplot. For this type of plot we often want to ignore the diagonal and upper/lower triangle, and sort from lowest to highest:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">cor_tidy</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">product1</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">product2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">distinct</span><span class="p">(</span><span class="n">products</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">pmin</span><span class="p">(</span><span class="n">product1</span><span class="p">,</span><span class="w"> </span><span class="n">product2</span><span class="p">),</span><span class="w">
</span><span class="n">pmax</span><span class="p">(</span><span class="n">product1</span><span class="p">,</span><span class="w"> </span><span class="n">product2</span><span class="p">),</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"_vs_"</span><span class="p">),</span><span class="w"> </span><span class="n">.keep_all</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fct_reorder</span><span class="p">(</span><span class="n">products</span><span class="p">,</span><span class="w"> </span><span class="n">corr</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corr</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corr</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0.7</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_col</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.7</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylim</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">"products"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">aspect.ratio</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Cor > 0.7"</span><span class="p">,</span><span class="w">
</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Yes"</span><span class="p">,</span><span class="w"> </span><span class="s2">"No"</span><span class="p">))</span></code></pre></figure>
<p><img src="../assets/images/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" width="80%" /></p>
<p>Here we are using a neat trick to ignore rows with duplicate product IDs ignoring its order (see <a href="https://stackoverflow.com/questions/38687545/r-select-first-dataframe-row-for-each-unique-pair-ignoring-order">this</a> and <a href="https://stackoverflow.com/questions/28574006/unique-rows-considering-two-columns-in-r-without-order answers from Stackoverflow">this</a>). The previous trick can be generalized to more than two columns, although it is not trivial (see <a href="https://stackoverflow.com/questions/30332490/finding-unique-tuples-in-r-but-ignoring-order">this</a> question for a base R solution). Let’s create first some example data:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">values</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"B"</span><span class="p">,</span><span class="w"> </span><span class="s2">"C"</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">ID1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">values</span><span class="p">,</span><span class="w"> </span><span class="n">ID2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">values</span><span class="p">,</span><span class="w"> </span><span class="n">ID3</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">values</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">df</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## ID1 ID2 ID3
## 1 A A A
## 2 B A A
## 3 C A A
## 4 A B A
## 5 B B A
## 6 C B A
## 7 A C A
## 8 B C A
## 9 C C A
## 10 A A B
## 11 B A B
## 12 C A B
## 13 A B B
## 14 B B B
## 15 C B B
## 16 A C B
## 17 B C B
## 18 C C B
## 19 A A C
## 20 B A C
## 21 C A C
## 22 A B C
## 23 B B C
## 24 C B C
## 25 A C C
## 26 B C C
## 27 C C C</code></pre></figure>
<p>We would like to obtain unique ID combinations without taking order into account, that is, “AAB” and “ABA” are both the same:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">distinct</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">ID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pmap_chr</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">starts_with</span><span class="p">(</span><span class="s2">"ID"</span><span class="p">)),</span><span class="w">
</span><span class="o">~</span><span class="n">paste0</span><span class="p">(</span><span class="n">sort</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">...</span><span class="p">)),</span><span class="w"> </span><span class="n">collapse</span><span class="o">=</span><span class="s2">"_"</span><span class="p">)))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## ID
## 1 A_A_A
## 2 A_A_B
## 3 A_A_C
## 4 A_B_B
## 5 A_B_C
## 6 A_C_C
## 7 B_B_B
## 8 B_B_C
## 9 B_C_C
## 10 C_C_C</code></pre></figure>
<p>Note the <code class="language-plaintext highlighter-rouge">c(...)</code>, since the <code class="language-plaintext highlighter-rouge">.f</code> argument in <code class="language-plaintext highlighter-rouge">pmap()</code> is a function with as many arguments as columns in the data frame (in contrast to base <code class="language-plaintext highlighter-rouge">apply()</code>). Thus we need to collect them all in a vector, which is then sorted and finally converted into a single value with <code class="language-plaintext highlighter-rouge">paste(..., collapse="_")</code>.</p>Alberto Torres Barránalbertotb@gmail.comThis small example aims to provide some use cases for the tidyr package. Let’s generate some example data first: library(lubridate) library(tibble) library(dplyr) library(tidyr) library(ggplot2) library(forcats) library(purrr) set.seed(1234) sales <- tibble(date = ymd(rep(c(20180101, 20180102, 20180103), 3)), product = rep(c("A", "B", "C"), each = 3), sales = sample(1:20, size = 9, replace = T)) sales ## # A tibble: 9 x 3 ## date product sales ## <date> <chr> <int> ## 1 2018-01-01 A 3 ## 2 2018-01-02 A 13 ## 3 2018-01-03 A 13 ## 4 2018-01-01 B 13 ## 5 2018-01-02 B 18 ## 6 2018-01-03 B 13 ## 7 2018-01-01 C 1 ## 8 2018-01-02 C 5 ## 9 2018-01-03 C 14 We want to compute the correlation of the sales from products A, B and C. The base R function cor() takes a matrix or data.frame and computes the correlation between all the column pairs. Thus, first we need to convert the data.frame sales, which is in long form, to wide form with one column per product. cor_matrix <- sales %>% spread(key = product, value = sales) %>% select(-date) %>% cor() cor_matrix ## A B C ## A 1.0000000 0.5000000 0.7370435 ## B 0.5000000 1.0000000 -0.2167775 ## C 0.7370435 -0.2167775 1.0000000 To manipulate the correlation matrix using tidyverse-related functions we need to convert back the previous matrix to a long data.frame: cor_tidy <- cor_matrix %>% as.data.frame() %>% rownames_to_column(var = "product1") %>% gather(key = product2, value = corr, -product1) cor_tidy ## product1 product2 corr ## 1 A A 1.0000000 ## 2 B A 0.5000000 ## 3 C A 0.7370435 ## 4 A B 0.5000000 ## 5 B B 1.0000000 ## 6 C B -0.2167775 ## 7 A C 0.7370435 ## 8 B C -0.2167775 ## 9 C C 1.0000000 Now we can plot the correlation matrix using ggplot2, for instance with a heatmap: ggplot(cor_tidy, aes(x = product1, y = product2, fill = corr)) + geom_tile() + scale_fill_gradient2(limits = c(-1, 1)) Another common way of representing correlation is a vertical barplot. For this type of plot we often want to ignore the diagonal and upper/lower triangle, and sort from lowest to highest: cor_tidy %>% filter(product1 != product2) %>% distinct(products = paste(pmin(product1, product2), pmax(product1, product2), sep = "_vs_"), .keep_all = TRUE) %>% ggplot(aes(x = fct_reorder(products, corr), y = corr, fill = corr > 0.7)) + geom_col(width = 0.7) + coord_flip() + ylim(-1, 1) + xlab("products") + theme(aspect.ratio = 1/3) + scale_fill_discrete(name = "Cor > 0.7", breaks = c(TRUE, FALSE), labels = c("Yes", "No")) Here we are using a neat trick to ignore rows with duplicate product IDs ignoring its order (see this and this). The previous trick can be generalized to more than two columns, although it is not trivial (see this question for a base R solution). Let’s create first some example data: values <- c("A", "B", "C") df <- expand.grid(ID1 = values, ID2 = values, ID3 = values, stringsAsFactors = FALSE) df ## ID1 ID2 ID3 ## 1 A A A ## 2 B A A ## 3 C A A ## 4 A B A ## 5 B B A ## 6 C B A ## 7 A C A ## 8 B C A ## 9 C C A ## 10 A A B ## 11 B A B ## 12 C A B ## 13 A B B ## 14 B B B ## 15 C B B ## 16 A C B ## 17 B C B ## 18 C C B ## 19 A A C ## 20 B A C ## 21 C A C ## 22 A B C ## 23 B B C ## 24 C B C ## 25 A C C ## 26 B C C ## 27 C C C We would like to obtain unique ID combinations without taking order into account, that is, “AAB” and “ABA” are both the same: distinct(df, ID = pmap_chr(select(df, starts_with("ID")), ~paste0(sort(c(...)), collapse="_"))) ## ID ## 1 A_A_A ## 2 A_A_B ## 3 A_A_C ## 4 A_B_B ## 5 A_B_C ## 6 A_C_C ## 7 B_B_B ## 8 B_B_C ## 9 B_C_C ## 10 C_C_C Note the c(...), since the .f argument in pmap() is a function with as many arguments as columns in the data frame (in contrast to base apply()). Thus we need to collect them all in a vector, which is then sorted and finally converted into a single value with paste(..., collapse="_").Equivalence between distribution functions in R and Python2019-01-22T00:00:00+00:002019-01-22T00:00:00+00:00https://albertotb.github.io/Equivalence-between-distribution-functions-in-R-and-Python<p>The name for the different functions that work with probability distributions in R and SciPy is different, which is often confusing. The following table lists the equivalence between the main functions:</p>
<table>
<thead>
<tr>
<th>R</th>
<th>SciPy</th>
<th>Name</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="language-plaintext highlighter-rouge">dnorm()</code></td>
<td><code class="language-plaintext highlighter-rouge">pdf()</code></td>
<td>Probability density function (PDF)</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">pnorm()</code></td>
<td><code class="language-plaintext highlighter-rouge">cdf()</code></td>
<td>Cumulative density function (CDF)</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">qnorm()</code></td>
<td><code class="language-plaintext highlighter-rouge">ppf()</code></td>
<td>Percentile point function (CDF inverse)</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">pnorm(lower.tail = FALSE)</code></td>
<td><code class="language-plaintext highlighter-rouge">sf()</code></td>
<td>Complementary CDF (CCDF) or survival function</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">qnorm(lower.tail = FALSE)</code></td>
<td><code class="language-plaintext highlighter-rouge">isf()</code></td>
<td>CCDF inverse or inverse survival function</td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">rnorm()</code></td>
<td><code class="language-plaintext highlighter-rouge">rvs()</code></td>
<td>Random samples</td>
</tr>
</tbody>
</table>
<p>Note: in R the names are ilustrated using the normal distribution. Functions for other distributions can be constructed keeping the first letter of the name and changing the name of the distribution, for example, for the gamma distribution: <code class="language-plaintext highlighter-rouge">dgamma()</code>, <code class="language-plaintext highlighter-rouge">pgamma()</code>, <code class="language-plaintext highlighter-rouge">qgamma()</code> and <code class="language-plaintext highlighter-rouge">rgamma()</code>.</p>Alberto Torres Barránalbertotb@gmail.comThe name for the different functions that work with probability distributions in R and SciPy is different, which is often confusing. The following table lists the equivalence between the main functions: R SciPy Name dnorm() pdf() Probability density function (PDF) pnorm() cdf() Cumulative density function (CDF) qnorm() ppf() Percentile point function (CDF inverse) pnorm(lower.tail = FALSE) sf() Complementary CDF (CCDF) or survival function qnorm(lower.tail = FALSE) isf() CCDF inverse or inverse survival function rnorm() rvs() Random samplesCreate new example environment in Latex2017-11-03T00:00:00+00:002017-11-03T00:00:00+00:00https://albertotb.github.io/Create-new-example-environment-in-Latex<p>The following code can be used to create a new Example environment that ends with a triangle instead of a square.</p>
<div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\theoremstyle</span><span class="p">{</span>definition<span class="p">}</span>
<span class="k">\newtheorem</span><span class="p">{</span>examplex<span class="p">}{</span>Example<span class="p">}</span>
<span class="k">\newenvironment</span><span class="p">{</span>example<span class="p">}</span>
<span class="p">{</span><span class="k">\pushQED</span><span class="p">{</span><span class="k">\qed</span><span class="p">}</span><span class="k">\renewcommand</span><span class="p">{</span><span class="k">\qedsymbol</span><span class="p">}{$</span><span class="nv">\triangle</span><span class="p">$}</span><span class="k">\examplex</span><span class="p">}</span>
<span class="p">{</span><span class="k">\popQED\endexamplex</span><span class="p">}</span>
</code></pre></div></div>Alberto Torres Barránalbertotb@gmail.comThe following code can be used to create a new Example environment that ends with a triangle instead of a square.Split long Jupyter notebook2017-10-20T00:00:00+00:002017-10-20T00:00:00+00:00https://albertotb.github.io/Split-long-Jupyter-notebook<p>The following code can be used, adapted from <a href="https://blog.ouseful.info/2015/12/03/some-jupyter-notebook-nbconvert-housekeeping-hints/">here</a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python
</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">IPython.nbformat</span> <span class="k">as</span> <span class="n">nb</span>
<span class="kn">import</span> <span class="nn">IPython.nbformat.v4.nbbase</span> <span class="k">as</span> <span class="n">nb4</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span><span class="p">:</span>
<span class="k">print</span> <span class="s">"usage: {} NOTEBOOK"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">sys</span><span class="p">.</span><span class="nb">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">mynb</span> <span class="o">=</span> <span class="n">nb</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">nb</span><span class="p">.</span><span class="n">NO_CONVERT</span><span class="p">)</span>
<span class="n">basename</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">c</span><span class="o">=</span><span class="mi">1</span>
<span class="n">test</span><span class="o">=</span><span class="n">nb4</span><span class="p">.</span><span class="n">new_notebook</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">mynb</span><span class="p">[</span><span class="s">'cells'</span><span class="p">]:</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="p">[</span><span class="s">'cell_type'</span><span class="p">]</span><span class="o">==</span><span class="s">'markdown'</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="s">'SPLIT NOTEBOOK'</span> <span class="ow">in</span> <span class="n">i</span><span class="p">[</span><span class="s">'source'</span><span class="p">]):</span>
<span class="n">nb</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">test</span><span class="p">,</span><span class="s">'{}_{}.ipynb'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">basename</span><span class="p">,</span> <span class="n">c</span><span class="p">))</span>
<span class="n">c</span><span class="o">=</span><span class="n">c</span><span class="o">+</span><span class="mi">1</span>
<span class="n">test</span><span class="o">=</span><span class="n">nb4</span><span class="p">.</span><span class="n">new_notebook</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">test</span><span class="p">.</span><span class="n">cells</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nb4</span><span class="p">.</span><span class="n">new_markdown_cell</span><span class="p">(</span><span class="n">i</span><span class="p">[</span><span class="s">'source'</span><span class="p">]))</span>
<span class="k">elif</span> <span class="p">(</span><span class="n">i</span><span class="p">[</span><span class="s">'cell_type'</span><span class="p">]</span><span class="o">==</span><span class="s">'code'</span><span class="p">):</span>
<span class="n">cc</span><span class="o">=</span><span class="n">nb4</span><span class="p">.</span><span class="n">new_code_cell</span><span class="p">(</span><span class="n">i</span><span class="p">[</span><span class="s">'source'</span><span class="p">])</span>
<span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">i</span><span class="p">[</span><span class="s">'outputs'</span><span class="p">]:</span>
<span class="n">cc</span><span class="p">[</span><span class="s">'outputs'</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">o</span><span class="p">)</span>
<span class="n">test</span><span class="p">.</span><span class="n">cells</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">cc</span><span class="p">)</span>
<span class="n">nb</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">test</span><span class="p">,</span><span class="s">'{}_{}.ipynb'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">basename</span><span class="p">,</span> <span class="n">c</span><span class="p">))</span>
</code></pre></div></div>Alberto Torres Barránalbertotb@gmail.comThe following code can be used, adapted from hereTest if port 22 is open2013-10-01T17:18:00+00:002013-10-01T17:18:00+00:00https://albertotb.github.io/bash-test-if-port-22-is-openWith this script you can check if a remote machine is listening on port 22, which is the default SSH port.
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">port</span><span class="o">=</span>22
<span class="nb">timeout</span><span class="o">=</span>5
<span class="k">if</span> <span class="o">((</span> <span class="s2">"$#"</span> <span class="o">!=</span> 1 <span class="o">))</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"usage: </span><span class="si">$(</span><span class="nb">basename</span> <span class="nv">$0</span><span class="si">)</span><span class="s2"> HOST"</span>
<span class="k">fi
</span><span class="nv">host</span><span class="o">=</span><span class="nv">$1</span>
<span class="k">if </span>nc <span class="nt">-w</span> <span class="nv">$timeout</span> <span class="nt">-z</span> <span class="nv">$host</span> <span class="nv">$port</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"Yes"</span>
<span class="k">else
</span><span class="nb">echo</span> <span class="s2">"No"</span>
<span class="k">fi</span></code></pre></figure>Alberto Torres Barránalbertotb@gmail.comWith this script you can check if a remote machine is listening on port 22, which is the default SSH port.R read dataset2013-09-19T16:03:00+00:002013-09-19T16:03:00+00:00https://albertotb.github.io/r-read-datasetUsually, I store my datasets in an ASCII/CSV file where the first column is the output or response and the subsequent columns are the input variable, with on row per pattern/observation. In order to load those datasets in R, I'll often find myself separating the input from the output into two variables to feed them into some algorithm. Therefore I created the following function, that can be added to the .Rprofile
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">read.dataset</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">file</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="n">file</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">data</span><span class="p">[,</span><span class="o">-</span><span class="n">response</span><span class="p">])</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="p">[,</span><span class="n">response</span><span class="p">]</span><span class="w">
</span><span class="n">dataset</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
With the previous function I can read the dataset in one line, and access separatly the input variables and the output
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">train</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.dataset</span><span class="p">(</span><span class="s2">"somedata.train"</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">train</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">train</span><span class="o">$</span><span class="n">x</span><span class="p">)</span></code></pre></figure>
The function also works if the output is not in the first column, changing the optional parameter <code>response</code>. Optional parameters are also passed along to R function <code>read.table</code>, for instance if the columns are delimited by commas instead of spaces.Alberto Torres Barránalbertotb@gmail.comUsually, I store my datasets in an ASCII/CSV file where the first column is the output or response and the subsequent columns are the input variable, with on row per pattern/observation. In order to load those datasets in R, I'll often find myself separating the input from the output into two variables to feed them into some algorithm. Therefore I created the following function, that can be added to the .RprofileVIM reemplazar palabras completas2013-09-04T11:13:00+00:002013-09-04T11:13:00+00:00https://albertotb.github.io/vim-reemplazar-palabras-completasExpresión regular de vim para reemplazar palabras completas. En este ejemplo se reemplazarán todos los foo por bar, pero no se reemplazarán si está contenido en otra palabra (fooxyz no se cambia por barxyz) `%s/\<foo\>/bar/g`
<a href="http://stackoverflow.com/questions/1778501/find-and-replace-whole-words-in-vim">Fuente</a></foo>Alberto Torres Barránalbertotb@gmail.comExpresión regular de vim para reemplazar palabras completas. En este ejemplo se reemplazarán todos los foo por bar, pero no se reemplazarán si está contenido en otra palabra (fooxyz no se cambia por barxyz) `%s/\/bar/g` Fuenteawk: crear matriz de confusión2012-09-18T16:46:00+00:002012-09-18T16:46:00+00:00https://albertotb.github.io/awk-crear-matriz-de-confusionEn <i>machine learning</i> una forma muy común de mostrar los resultados de un modelo aplicado a un problema de clasificación es a través de una <a href="http://en.wikipedia.org/wiki/Confusion_matrix">matriz de confusión</a>. El siguiente script de awk crea una matriz de confusión a partir de un fichero donde la primera columna es la salida del modelo (binaria, 0 ó 1) y la segunda columna es la variable de salida real (también binaria, 0 ó 1):
<figure class="highlight"><pre><code class="language-awk" data-lang="awk"><span class="kr">BEGIN</span><span class="p">{</span><span class="nx">tp</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="nx">tn</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="nx">fp</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="nx">fn</span><span class="o">=</span><span class="mi">0</span><span class="p">}</span>
<span class="p">{</span>
<span class="k">if</span><span class="p">(</span><span class="nv">$1</span><span class="o">==</span><span class="mi">0</span><span class="p">)</span>
<span class="nv">$2</span><span class="o">==</span><span class="mi">0</span><span class="p">?</span> <span class="nx">tn</span><span class="o">++</span> <span class="p">:</span> <span class="nx">fn</span><span class="o">++</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="nv">$1</span><span class="o">==</span><span class="mi">1</span><span class="p">)</span>
<span class="nv">$2</span><span class="o">==</span><span class="mi">1</span><span class="p">?</span> <span class="nx">tp</span><span class="o">++</span> <span class="p">:</span> <span class="nx">fp</span><span class="o">++</span>
<span class="p">}</span>
<span class="kr">END</span><span class="p">{</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"\nConfusion matrix:\n"</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"+-----+------------+\n"</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"| A\\P | 0 1 |\n"</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"+-----+------------+\n"</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"| 0 | %4d %4d |\n"</span><span class="p">,</span><span class="nx">tn</span><span class="p">,</span><span class="nx">fp</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"| 1 | %4d %4d |\n"</span><span class="p">,</span><span class="nx">fn</span><span class="p">,</span><span class="nx">tp</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"+-----+------------+\n"</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"\nA=Actual, P=Predicted\n\n"</span><span class="p">)</span>
<span class="nx">tpr</span> <span class="o">=</span> <span class="nx">tp</span><span class="o">/</span><span class="p">(</span><span class="nx">tp</span><span class="o">+</span><span class="nx">fn</span><span class="p">)</span>
<span class="nx">tnr</span> <span class="o">=</span> <span class="nx">tn</span><span class="o">/</span><span class="p">(</span><span class="nx">fp</span><span class="o">+</span><span class="nx">tn</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"Sensitivity = %g%\n"</span><span class="p">,</span> <span class="nx">tpr</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"Specificity = %g%\n"</span><span class="p">,</span> <span class="nx">tnr</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span>
<span class="k">printf</span><span class="p">(</span><span class="s2">"Accuracy (balanced) = %g%\n"</span><span class="p">,</span> <span class="p">(</span><span class="nx">tpr</span><span class="o">+</span><span class="nx">tnr</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span>
<span class="p">}</span></code></pre></figure>
Un uso bastante común de este script es cuando tenemos un fichero con datos de test (<code>data.test</code>) donde cada columna representa una variable, separadas por comas. Una de esas columnas es la variable de salida (clase 0 ó clase 1). Para este ejemplo, vamos a suponer que dicha variable de salida se encuentra en la primera columna. Además, en un fichero aparte (<code>modelo.output</code>) tendríamos una única columna con la salida de nuestro modelo de clasificación aplicado en ese mismo fichero de test. En este caso, el script anterior se usa de la siguiente manera:
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">cut</span> <span class="nt">-d</span><span class="s2">","</span> <span class="nt">-f1</span> data.test | <span class="nb">paste</span> <span class="nt">-d</span><span class="s2">" "</span> modelo.output - | <span class="nb">awk</span> <span class="nt">-f</span> conf_matrix.awk </code></pre></figure>
La salida del comando anterior (suponiendo que el script de awk se encuentra en el fichero <code>conf_matrix.awk</code>) sería algo del estilo:
<figure class="highlight"><pre><code class="language-terminal" data-lang="terminal"><span class="go">Accuracy = 71.4286% (220/308) (classification)
Confusion matrix:
+-----+------------+
| A\P | 0 1 |
+-----+------------+
| 0 | 194 74 |
| 1 | 14 26 |
+-----+------------+
A=Actual, P=Predicted
Sensitivity = 65%
Specificity = 72.3881%
Accuracy (balanced) = 68.694%</span></code></pre></figure>Alberto Torres Barránalbertotb@gmail.comEn machine learning una forma muy común de mostrar los resultados de un modelo aplicado a un problema de clasificación es a través de una matriz de confusión. El siguiente script de awk crea una matriz de confusión a partir de un fichero donde la primera columna es la salida del modelo (binaria, 0 ó 1) y la segunda columna es la variable de salida real (también binaria, 0 ó 1):