Appsilon Data Science BlogHow to create and use technology to deliver business results.
http://appsilondatascience.com/blog/
Fri, 08 Dec 2017 08:24:59 +0000Fri, 08 Dec 2017 08:24:59 +0000Jekyll v3.4.5Writing Excel formatted csv using readr::write_excel_csv2<h2 id="why-this-post">Why this post?</h2>
<p>Currently, my team and I are building a Shiny app that serves as an interface for a forecasting model. The app allows business users to interact with predictions. However, we keep getting feature requests, such as, “Can we please have this exported to Excel.”</p>
<p>Our client chose to see results exported to a csv file and wants to open them in Excel. App is already running on the Linux server and the csv that can be downloaded via app are <strong>utf-8</strong> encoded.</p>
<p>If you are a Linux user you may not be aware that Windows Excel is not able to recognize utf-8 encoding automatically. It turns out that a <a href="https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically/6002338#6002338">few people</a> faced this problem in the past.</p>
<p>Obviously, we cannot have a solution where our users are changing options in Excel or opening the file in any other way than double clicking.</p>
<p>We find having a Shiny App that allows for Excel export to be a good compromise between R/Shiny and Excel. It gives the user the power of interactivity and online access, while still preserving the possibility to work with the results in the environment they are most used to. This a great way to gradually accustom users with working in Shiny.</p>
<h2 id="current-available-solution-in-r">Current available solution in R</h2>
<p>What we want is the following, write a csv file with <strong>utf-8</strong> encoding and BOM<label for="‘BOM’" class="margin-toggle" style="font-size: 0.8em; text-decoration: underline"><i class="fa fa-sticky-note" aria-hidden="true"></i> sticky note</label><input type="checkbox" id="‘BOM’" class="margin-toggle" /><span class="marginnote">The byte order mark (BOM) is a Unicode character which tells about the encoding of the document. </span>. This has been <a href="https://github.com/tidyverse/readr/issues/375">addressed in R by RStudio</a> in <code class="highlighter-rouge">readr</code> package.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">readr</span><span class="p">)</span><span class="w">
</span><span class="n">write_excel_csv</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span><span class="w"> </span><span class="s2">"assets/data/readr/my_file.csv"</span><span class="p">)</span></code></pre></figure>
<p>This is great and solves the problem with opening the file in Excel, but… supports only one type of locale.</p>
<h2 id="show-me-your-locale">Show me your locale</h2>
<p>Depending on where you live you might have different locale. <a href="https://en.wikipedia.org/wiki/Locale_(computer_software)">Locale</a> is a set of parameters that defines the user’s language, region and any special variant preferences that the user wants to see in their user interface.</p>
<p>This means that number formatting can differ between different regions, for example in the USA <code class="highlighter-rouge">.</code> is used as a decimal separator, but on the other hand almost whole Europe uses <code class="highlighter-rouge">,</code>. This <a href="https://en.wikipedia.org/wiki/Decimal_mark">article</a> shows how countries around the world define their number formats.</p>
<p>This proves that there is a large need to extend the <code class="highlighter-rouge">readr</code> functionality and allow users to save Excel with European locale easily and quickly. This is not currently possible since <code class="highlighter-rouge">write_excel_csv</code> only allows one to write in the US locale.</p>
<h2 id="new-addition-to-readr">New addition to readr</h2>
<p>We proposed to add <code class="highlighter-rouge">write_excel_csv2()</code> to <code class="highlighter-rouge">readr</code> package that would allow the user to write a csv with <code class="highlighter-rouge">,</code> as a decimal separator and <code class="highlighter-rouge">;</code> as column separator. To be consistent with naming convention in R for functions reading in (e.g. <code class="highlighter-rouge">read.csv()</code> and <code class="highlighter-rouge">read.csv2()</code>) or writing (e.g. <code class="highlighter-rouge">write.csv()</code> and <code class="highlighter-rouge">write.csv2()</code>) csv files with different delimiter we decided to simply add <code class="highlighter-rouge">2</code> to <code class="highlighter-rouge">write_excel_csv()</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tmp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="nf">on.exit</span><span class="p">(</span><span class="n">unlink</span><span class="p">(</span><span class="n">tmp</span><span class="p">))</span><span class="w">
</span><span class="n">readr</span><span class="o">::</span><span class="n">write_excel_csv2</span><span class="p">(</span><span class="n">mtcars</span><span class="p">,</span><span class="w"> </span><span class="n">tmp</span><span class="p">)</span></code></pre></figure>
<p>To prove that it works, let’s read the first two lines and inspect the output.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">readr</span><span class="o">::</span><span class="n">read_lines</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span><span class="w"> </span><span class="n">n_max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] "mpg;cyl;disp;hp;drat;wt;qsec;vs;am;gear;carb"
## [2] "21,0;6;160,0;110;3,90;2,620;16,46;0;1;4;4"</code></pre></figure>
<p><code class="highlighter-rouge">write_excel_csv2()</code> is already available for download from <code class="highlighter-rouge">readr</code> repository and should be available on CRAN with the next release.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"tidyverse/readr"</span><span class="p">)</span></code></pre></figure>
<p>We hope you and your business team will find this addition useful.</p>
Fri, 08 Dec 2017 00:00:00 +0000
http://appsilondatascience.com/blog/rstats/2017/12/08/readr.html
/blog/rstats/2017/12/08/readr.htmlExcel,readr,csv,write_excel_csv2,write.csv2rstatsHow to build a Successful Advanced Analytics Department<p>This article presents our opinions and suggestions on how an Advanced Analytics department should operate. The post is not intended to be a comprehensive list of steps, but rather, a list of tips and warnings. We hope this will be useful for those who want to implement analytics work in their company, as well as for existing departments.</p>
<p>The post is divided into 3 parts. The first provides a list of the most important aspects to be aware of when leading AA work in your organization. If you are already leading such a department, then you are already aware of these but it can still prove useful when presenting your case higher in the hierarchy or to another department. The second part will include a list of the most important elements that need to be addressed and cared for in such a department. Lastly, we caution you of the most common issues we have seen in failed initiatives.</p>
<h2 id="part-i---why-should-you-care">Part I - why should you care?</h2>
<h3 id="increase-margin">Increase margin</h3>
<p>The most significant benefits are for business. One of the most important challenges is meeting management and C-level expectations. A proper implementation, does, however, increase the chances of success and can even guarantee returns in the initial phases. AI and advanced analytics solutions are very tempting but not every company is ready for them and they can be expensive. A company should determine their current level of progress. Based on this, they should determine what the logical next step should be. Trying to do too much at once carries increased risks. However advanced analytics is a good starting board for getting into AI.</p>
<h3 id="new-business-models">New business models</h3>
<p>Machine learning and deep learning, in particular, have allowed for completely new possibilities in the realm of predictions. In addition, companies are collecting more and more data. It’s estimated that the planet generates over 2.5 million terabytes of data every day (2.5 exabytes) <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. The rate of growth is accelerating (we double our data every 2 years - Moore’s Data Storage Law) at an exponential rate. Thanks to that it is possible to train analytical models that even a few years ago would have been impossible. Business has taken notice of their data sets’ power, leading them to develop completely new products and initiatives.</p>
<p><label for="‘baidu’" class="margin-toggle" style="font-size: 0.8em; text-decoration: underline"><i class="fa fa-sticky-note" aria-hidden="true"></i> sticky note</label><input type="checkbox" id="‘baidu’" class="margin-toggle" /><span class="marginnote">Baidu used vast amount of data to train there voice recognition model. <a href="https://blog.ycombinator.com/baidus-ai-lab-director-on-advancing-speech-recognition-and-simulation/">Listen</a> to <a href="https://twitter.com/adampaulcoates">Adam Coates</a>, Director of <a href="http://research.baidu.com/">Baidu’s Silicon Valley AI Lab</a>, discussing how they acquired data and improved their model. . </span></p>
<h3 id="reduce-manual-work-time-and-errors">Reduce manual work time and errors</h3>
<p>The truth is that most of the companies still heavily depends on manual work. We’ve seen tons of spreadsheets and understand that they are not going anywhere anytime soon. With that said, it’s hard to justify staying solely in spreadsheets in 2018. We’ve seen real-time dashboards replacing monthly spreadsheet summary. We are sure teams currently pasting data into cells every week and every month would be happy to automate these tasks and start analyzing data within. <a href="https://www.economist.com/news/business/21720675-firm-using-algorithm-designed-cern-laboratory-how-germanys-otto-uses">Otto, a German e-Commerce</a> shows to what extent it is possible to minimize manual work and improve business metrics. Our intuition tells us that this would replace employees with algorithms, but no. They even increased their workforce.</p>
<h3 id="your-competition-is-already-working-on-it">Your competition is already working on it</h3>
<p>The last and potentially strongest argument why this is worth considering: Your competition is already working on this. We’ve analyzed and worked with tens of different industries and identified the leaders within Advanced Analytics and AI. These companies are pulling away and experience similar compound returns as the aggregate data we generate. It’s hard to tell whether they will maintain their lead but we are certain that companies who refuse to take steps in this direction will not bode well.</p>
<p>Using data is proven to work. Our client’s examples showcase the possibilities. We’ve built and implemented a dynamic pricing model that deals with over 2 million quarterly pricing decisions. Increased fraud detection from 50% to over 90% and more accurately predicted e-commerce sales a year in advance. Our portfolio includes AI and AA projects for a large range of industries, often times including industry leaders.</p>
<p>If you discover more frauds than your competition – you get an advantage. Moreover, you build from there. You get new ideas every time you work with data. The power and value of using data are spreading within organizations as managers start noticing results.</p>
<h2 id="how-to-get-an-advantage">How to get an advantage?</h2>
<h3 id="self-audit">Self-audit</h3>
<p>The very first thing is to understand where you stand today. It might be that you already gather a lot of data, but your organization is mostly driven by spreadsheets. This is how a vast number of organizations are managed. The good news, in this case, is that your next steps are easier to implement and the costs are significantly lower.</p>
<p>It may be that your company is already doing predictive analytics and you are thinking about prescriptive solutions. As such, you already have a large understanding of the benefits and amount of work needed to take it to the next level.</p>
<h4 id="gartner-report">Gartner report</h4>
<p>A Gartner <sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> breakdown of the four phases of data science projects may help you determine your current position. Treat this as an approximation as it may be the case that different departments are at different levels.</p>
<p>There are four fundamental ways of creating and using insights:</p>
<ul>
<li>descriptive - one that focuses on gathering facts about past. Most of the analytical dashboards work this way. They present read-only data with high-level metrics.</li>
<li>diagnostic - this one makes a use of the data to understand the reasons behind the observed values.</li>
<li>predictive - using forecasts is the first step to directly influence the decision process. Creating forecasts always require analytical model underneath. Quality of our decisions depends on the accuracy of the model.</li>
<li>prescriptive - leads directly to a decision on suggestions or automation. In the first case, a decision support system is created that closely cooperate with a human.</li>
</ul>
<p>Short summary:</p>
<style>
.post-content th, td {
border-bottom: 1px solid #333333 !important;
}
.post-content table {
border-bottom: 1px solid #333333;
}
</style>
<p><br /></p>
<table>
<thead>
<tr>
<th>Type</th>
<th style="text-align: center">Insight</th>
</tr>
</thead>
<tbody>
<tr>
<td>Descriptive</td>
<td style="text-align: center">What was the sales last month?</td>
</tr>
<tr>
<td>Diagnostic</td>
<td style="text-align: center">How did the weather influence our sales?</td>
</tr>
<tr>
<td>Predictive</td>
<td style="text-align: center">Next week is going to be super hot. Should we get more supplies?</td>
</tr>
<tr>
<td>Prescriptive (support system)</td>
<td style="text-align: center">Get 120 more vanilla ice creams to that store.</td>
</tr>
<tr>
<td>Prescriptive (automated)</td>
<td style="text-align: center">Requests ice creams for you.</td>
</tr>
</tbody>
</table>
<p><br />
<br /></p>
<p>Take note that earlier phases are more heavily reliant upon manual work and regular processing of the same tasks. It is the more advances steps the truly automate work. At these stages, people are crucial in defining problems and finalizing a final judgment.</p>
<p><br /></p>
<p><img src="/blog/assets/article_images/how-to-get-into-advanced-analytics/gartner.jpg" alt="The Four Analytic Capabilities, source: Gartner" /></p>
<p><br /></p>
<h3 id="have-a-clear-goal">Have a clear goal.</h3>
<p>One of the biggest issues in AA departments is that their existence is a goal in and of its self. Such a department is most useful when its operations are close to the core of companies existing problems. It’s important to identify a few key areas where you expect analysts to be able to work in. Such a goal should also be attainable. You will have a difficult time achieving expected results. It will be difficult to achieve Prescriptive results if you are still in a descriptive phase. Such a project will also be negatively received by the rest of the organization because of workplace politics, a lack of understanding, or other reasons.</p>
<h3 id="understand-the-difference-between-development-and-research">Understand the difference between development and research.</h3>
<p>Communicate it clearly to others. One of the most difficult situations an Advanced Analytics department may find itself in is agreeing to work on a difficult research-intensive project in the same way as traditional software development projects. This can be increasingly difficult as a large portion of tasks does indeed have a software development-like nature.</p>
<p><label for="‘mn-trick’" class="margin-toggle" style="font-size: 0.8em; text-decoration: underline"><i class="fa fa-sticky-note" aria-hidden="true"></i> sticky note</label><input type="checkbox" id="‘mn-trick’" class="margin-toggle" /><span class="marginnote">If you want to learn more about reproducible research make sure to check <a href="/blog/rstats/2017/03/28/reproducible-research-when-your-results-cant-be-reproduced.html">this post</a>. </span></p>
<p>Software development-like work:</p>
<ul>
<li>Interface elements, such as building a dashboard or API, UI/UX</li>
<li>Infrastructural work, for example, the environment for reproducible research</li>
</ul>
<p>Work of a research nature that should not be treated as engineering:</p>
<ul>
<li>building analytic models.</li>
<li>data processing and validation</li>
</ul>
<p>The tasks are different because estimating the full project scope up-front is impossible. Research is based on iterative hypothesis validation and incremental steps forward. What an Advanced Analytics department could commit to is to validate certain hypothesis in a given time frame, but not to build a model on a given precision or accuracy in a fixed time period.</p>
<p>It’s worth reiterating that software development of any kind is not a completely predictable process. There is a wide array of methodologies that help lead engineering projects more effectively and efficiently. At Appsilon Data Science we are fans of the Scrum methodology. It has proven very effective in our past and ongoing projects and can be adapted to Research resulting in a consistent project management. Download our <a href="/scrum-in-data-science">Scrum for Data Science Free E-book</a> if you want to learn more.</p>
<h3 id="get-the-right-skills-onboard">Get the right skills onboard</h3>
<p>If you have adequately met the three points listed, then the fourth will be much easier. You will not need advanced Deep Learning or complex statistics if your project is Descriptive or Diagnostic. What you will need is someone adept at data processing, modeling, and reporting. Projects that have a significant portion of software development can have an experienced programmer with data processing skills at its core. This is obvious but it’s important to hire talented individuals as they will in many ways: build your team, determine your technology stack and build your analytics culture. Talented individuals have the inherent nature of attracting other talented individuals who want learn from them.</p>
<h2 id="how-not-to-fail">How not to fail</h2>
<p>Even if you get everything right, there is still the risk of failure. We don’t have a silver bullet, but we know that anything below drastically increases your odds of failure.</p>
<h3 id="wrong-kpis">Wrong KPI’s</h3>
<blockquote>
<p>“There is nothing so useless as doing efficiently that which should not be done at all.”
Peter Drucker</p>
</blockquote>
<p>The way you measure your Advanced Analytics department will have a significant impact on their method of work.</p>
<p>A team working on one model such as a recommendation engine can use precision and recall as appropriate metrics for such a model. An alternative metric could be validating a given hypothesis. Both of these will not only affect the nature of the work and team morale but also the way in which management will view the team.</p>
<h3 id="make-sure-you-can-use-that-data">Make sure you can use that data</h3>
<p>It is unfortunate to see AA work put on hold as a result of not following data processing regulations and legislature. These situations are not uncommon; we expect their number to grow when GDPR comes into effect.</p>
<h3 id="wishful-thinking-goals">Wishful thinking goals</h3>
<p>Going for revolution instead of evolution is not a good idea. Research quality will suffer if the team experiences communication difficulties or is given unrealistic expectations from stakeholders. An example includes setting required model performance rates up front. Such a rate can be known if there are adequate business arguments but such rates should be considered a goal to aspire to.</p>
<p>Another example of such behavior is when the team tries to create something that is completely beyond their capabilities or experience. For example, a company at a Descriptive level would like to jump to a Prescriptive level. Such wide jumps inherently include a much larger risk but also tend to be much more expensive.</p>
<h3 id="thinking-that-excel-is-advanced-analytics">Thinking that Excel is Advanced Analytics</h3>
<p>The last issue is the most common. Thankfully, the communities awareness of advanced analytics is growing quickly but there are still cases when individuals are overly anxious about moving away from spreadsheets. The number of solutions is large, but it is worth investing some time up-front. The open source ecosystem is very large; one does not have to look very long to find a much better alternative.</p>
<p>The number of problems associated with spreadsheets is as long as the number of benefits. We understand that spreadsheets can often be useful, but we find it difficult to imagine that, for example, a responsible approach to reproducible research would include spreadsheet calculations.</p>
<h2 id="wrapping-things-up">Wrapping things up</h2>
<p>I trust that the advice you’ve read will be useful for you. We are always open to discuss how this can help your organization. If you know of situations where other solutions have proven useful, then let us know in the comments.</p>
<p>Sign up for our newsletter to stay in touch with us and follow us on social media. Cheers!</p>
<p><br /></p>
<p>Sources:</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>https://www.ibm.com/blogs/insights-on-business/consumer-products/2-5-quintillion-bytes-of-data-created-every-day-how-does-cpg-retail-manage-it/ <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day/ <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>https://www.gartner.com/binaries/content/assets/events/keywords/catalyst/catus8/2017_planning_guide_for_data_analytics.pdf <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 07 Dec 2017 00:00:00 +0000
http://appsilondatascience.com/blog/business/2017/12/07/successful-advanced-analytics-department.html
/blog/business/2017/12/07/successful-advanced-analytics-department.htmlbusinessAn introduction to Monte Carlo Tree Search<h2 id="introduction">Introduction</h2>
<p>We recently witnessed one of the biggest game AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found <a href="https://www.nature.com/articles/nature24270">here</a>. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights.</p>
<h2 id="perfect-information-games">Perfect Information Games</h2>
<p>Monte Carlo Tree Search is an algorithm used when playing a so-called perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or Tic-Tac-Toe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over <script type="math/tex">10^{170}</script>.
<label for="‘source" class="margin-toggle" style="font-size: 0.8em; text-decoration: underline"><i class="fa fa-sticky-note" aria-hidden="true"></i> sticky note</label><input type="checkbox" id="‘source" class="margin-toggle" /><span class="marginnote"><a href="https://tromp.github.io/go/gostate.pdf"><em>Source</em></a> </span></p>
<p>Every perfect information game can be represented in the form of a tree data structure in the following way. At first, you have a root which encapsulates the beginning state of the game. For Chess that would be 16 white figures and 16 black figures placed in the proper places on the chessboard. For Tic-Tac-Toe it would be simply 3x3 empty matrix. The first player has some number <script type="math/tex">n_1</script> of possible choices to make. In the case of Tic-Tac-Toe this would be 9 possible places to draw a circle. Each such move changes the state of the game. These outcome states are the children of the root node. Then, for each of <script type="math/tex">n_1</script> children, the next player has <script type="math/tex">n_2</script> possible moves to consider, each of them generating another state of the game - generating a child node. Note that <script type="math/tex">n_2</script> might differ for each of <script type="math/tex">n_1</script> nodes. For instance in chess you might make a move which forces your enemy to make a move with their king or consider another move which leaves your opponent with many other options.</p>
<p>An outcome of a play is a path from the root node to one of the leaves. Each leaf consist a definite information which player (or players) have won and which have lost the game.</p>
<h2 id="making-a-decision-based-on-a-tree">Making a decision based on a tree</h2>
<p>There are two main problems we face when making a decision in perfect information game. The first, and main one is the size of the tree.</p>
<p>This doesn’t bother us with very limited games such as Tic-Tac-Toe. We have at most 9 children nodes (at the beginning) and this number gets smaller and smaller as we continue playing. It’s a completely different story with Chess or Go. Here the corresponding tree is so huge that you cannot hope to search the entire tree. The way to approach this is to do a random walk on the tree for some time and get a subtree of the original decision tree.</p>
<p>This, however, creates a second problem. If every time we play we just walk randomly down the tree, we don’t care at all about the efficiency of our move and do not learn from our previous games. Whoever played Chess during his or her life knows that making random moves on a chessboard won’t get him too far. It might be good for a beginner to get an understanding of how the pieces move. But game after game it’s better to learn how to distinguish good moves from bad ones.</p>
<p>So, is there a way to somehow use the facts contained in the previously built decision trees to reason about our next move? As it turns out, there is.</p>
<h2 id="multi-armed-bandit-problem">Multi-Armed Bandit Problem</h2>
<p>Imagine that you are at a casino and would like to play a slot machine. You can choose one randomly and enjoy your game. Later that night, another gambler sits next to you and wins more in 10 minutes than you have during the last few hours. You shouldn’t compare yourself to the other guy, it’s just luck. But still, it’s normal to ask whether the next time you can do better. Which slot machine should you choose to win the most? Maybe you should play more than one machine at a time?</p>
<p>The problem you are facing is the Multi-Armed Bandit Problem. It was already known during II World War, but the most commonly known version today was formulated by Herbert Robbins in 1952. There are <script type="math/tex">N</script> slot machines, each one with a different expected return value (what you expect to net from a given machine). You don’t know the expected return values for any machine. You are allowed to change machines at any time and play on each machine as many times as you’d like. What is the optimal strategy for playing?</p>
<p>What does “optimal” mean in this scenario? Clearly your best option would be to play only on the machine with highest return value. An optimal strategy is a strategy for which you do as well as possible compared to the best machine.</p>
<p>It was <a href="https://ac.els-cdn.com/0196885885900028/1-s2.0-0196885885900028-main.pdf?_tid=bebd97a8-bda1-11e7-8cfc-00000aab0f6c&acdnat=1509388984_087d17f273327115f07d7cff1d0d294b">actually proven</a> that you cannot do better than <script type="math/tex">O( \ln n )</script> on average. So that’s the best you can hope for. Luckily, it was also proven that you can achieve this bound (again - on average). One way to do this is to do the following.</p>
<p><label for="‘source" class="margin-toggle" style="font-size: 0.8em; text-decoration: underline"><i class="fa fa-sticky-note" aria-hidden="true"></i> sticky note</label><input type="checkbox" id="‘source" class="margin-toggle" /><span class="marginnote">Read <a href="http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf">this paper</a> if you are interested in the proof. </span></p>
<p>For each machine <script type="math/tex">i</script> we keep track of two things: how many times we have tried this machine (<script type="math/tex">n_i</script>) and what the mean return value (<script type="math/tex">x_i</script>) was. We also keep track of how many times (<script type="math/tex">n</script>) we have played in general. Then for each i we compute the confidence interval around <script type="math/tex">x_i</script>:</p>
<script type="math/tex; mode=display">x_i \pm \sqrt{ 2 \cdot \frac{\ln n}{n_i} }</script>
<p>All the time we choose to play on the machine with the highest upper bound for <script type="math/tex">x_i</script> (so “+” in the formula above).</p>
<p>This is a solution to Multi-Armed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our pay-out. For simplicity, I will assume that we are only interested in winning, so pay-out is 1 if we have won and 0 otherwise.</p>
<h2 id="real-world-application-example">Real world application example</h2>
<p>MAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: <code class="highlighter-rouge">99$</code>, <code class="highlighter-rouge">100$</code>, <code class="highlighter-rouge">115$</code> and <code class="highlighter-rouge">120$</code>. Now you want to test how those prices will perform and which to choose eventually.
During first day of your experiment 4000 people visited your shop when the first price (<code class="highlighter-rouge">99$</code>) was tested and 368 bought the product,
for the rest of the prices we have the following outcome:</p>
<ul>
<li><code class="highlighter-rouge">100$</code> 4060 visits and 355 purchases,</li>
<li><code class="highlighter-rouge">115$</code> 4011 visits and 373 purchases,</li>
<li><code class="highlighter-rouge">120$</code> 4007 visits and 230 purchases.</li>
</ul>
<p>Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">bandit</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">visits_day1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4000</span><span class="p">,</span><span class="w"> </span><span class="m">4060</span><span class="p">,</span><span class="w"> </span><span class="m">4011</span><span class="p">,</span><span class="w"> </span><span class="m">4007</span><span class="p">)</span><span class="w">
</span><span class="n">purchase_day1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">368</span><span class="p">,</span><span class="w"> </span><span class="m">355</span><span class="p">,</span><span class="w"> </span><span class="m">373</span><span class="p">,</span><span class="w"> </span><span class="m">230</span><span class="p">)</span><span class="w">
</span><span class="n">prices</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">99</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">115</span><span class="p">,</span><span class="w"> </span><span class="m">120</span><span class="p">)</span><span class="w">
</span><span class="n">post_distribution</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_post</span><span class="p">(</span><span class="n">purchase_day1</span><span class="p">,</span><span class="w"> </span><span class="n">visits_day1</span><span class="p">,</span><span class="w"> </span><span class="n">ndraws</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w">
</span><span class="n">probability_winning</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">prob_winner</span><span class="p">(</span><span class="n">post_distribution</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">probability_winning</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">prices</span><span class="w">
</span><span class="n">probability_winning</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## 99 100 115 120
## 0.3960 0.0936 0.5104 0.0000</code></pre></figure>
<p>We calculated the Bayesian probability that the price performed the best and can see that the price <code class="highlighter-rouge">115$</code> has the highest probability (0.5). On the other hand <code class="highlighter-rouge">120$</code> seems bit too much for the customers.</p>
<p>The experiment continues for a few more days.</p>
<p>Day 2 results:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">visits_day2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">8030</span><span class="p">,</span><span class="w"> </span><span class="m">8060</span><span class="p">,</span><span class="w"> </span><span class="m">8027</span><span class="p">,</span><span class="w"> </span><span class="m">8037</span><span class="p">)</span><span class="w">
</span><span class="n">purchase_day2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">769</span><span class="p">,</span><span class="w"> </span><span class="m">735</span><span class="p">,</span><span class="w"> </span><span class="m">786</span><span class="p">,</span><span class="w"> </span><span class="m">420</span><span class="p">)</span><span class="w">
</span><span class="n">post_distribution</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_post</span><span class="p">(</span><span class="n">purchase_day2</span><span class="p">,</span><span class="w"> </span><span class="n">visits_day2</span><span class="p">,</span><span class="w"> </span><span class="n">ndraws</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000000</span><span class="p">)</span><span class="w">
</span><span class="n">probability_winning</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">prob_winner</span><span class="p">(</span><span class="n">post_distribution</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">probability_winning</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">prices</span><span class="w">
</span><span class="n">probability_winning</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## 99 100 115 120
## 0.308623 0.034632 0.656745 0.000000</code></pre></figure>
<p>After the second day price <code class="highlighter-rouge">115$</code> still shows the best results, with <code class="highlighter-rouge">99$</code> and <code class="highlighter-rouge">100$</code> performing very similar.</p>
<p>Using <code class="highlighter-rouge">bandit</code> package we can also perform significant analysis, which is handy for overall proportion comparison using <code class="highlighter-rouge">prop.test</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">significance_analysis</span><span class="p">(</span><span class="n">purchase_day2</span><span class="p">,</span><span class="w"> </span><span class="n">visits_day2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## successes totals estimated_proportion lower upper
## 1 769 8030 0.09576588 -0.004545319 0.01369494
## 2 735 8060 0.09119107 0.030860453 0.04700507
## 3 786 8027 0.09791952 -0.007119595 0.01142688
## 4 420 8037 0.05225831 NA NA
## significance rank best p_best
## 1 3.322143e-01 2 1 3.086709e-01
## 2 1.437347e-21 3 1 2.340515e-06
## 3 6.637812e-01 1 1 6.564434e-01
## 4 NA 4 0 1.548068e-39</code></pre></figure>
<p>At this point we can see that <code class="highlighter-rouge">120$</code> is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the <code class="highlighter-rouge">p_best</code> are very small (p_best has negligible value).</p>
<p>Day 3 results:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">visits_day3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">15684</span><span class="p">,</span><span class="w"> </span><span class="m">15690</span><span class="p">,</span><span class="w"> </span><span class="m">15672</span><span class="p">,</span><span class="w"> </span><span class="m">8037</span><span class="p">)</span><span class="w">
</span><span class="n">purchase_day3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1433</span><span class="p">,</span><span class="w"> </span><span class="m">1440</span><span class="p">,</span><span class="w"> </span><span class="m">1495</span><span class="p">,</span><span class="w"> </span><span class="m">420</span><span class="p">)</span><span class="w">
</span><span class="n">post_distribution</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_post</span><span class="p">(</span><span class="n">purchase_day3</span><span class="p">,</span><span class="w"> </span><span class="n">visits_day3</span><span class="p">,</span><span class="w"> </span><span class="n">ndraws</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000000</span><span class="p">)</span><span class="w">
</span><span class="n">probability_winning</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">prob_winner</span><span class="p">(</span><span class="n">post_distribution</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">probability_winning</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">prices</span><span class="w">
</span><span class="n">probability_winning</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## 99 100 115 120
## 0.087200 0.115522 0.797278 0.000000</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">value_remaining</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value_remaining</span><span class="p">(</span><span class="n">purchase_day3</span><span class="p">,</span><span class="w"> </span><span class="n">visits_day3</span><span class="p">)</span><span class="w">
</span><span class="n">potential_value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">value_remaining</span><span class="p">,</span><span class="w"> </span><span class="m">0.95</span><span class="p">)</span><span class="w">
</span><span class="n">potential_value</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## 95%
## 0.02670002</code></pre></figure>
<p>Day 3 results led us to conclude that <code class="highlighter-rouge">115$</code> will generate the highest conversion rate and revenue.
We are still unsure about the conversion probability for the best price <code class="highlighter-rouge">115$</code>, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining).</p>
<p>The histograms below show what happens to the value-remaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price <code class="highlighter-rouge">$115</code>.</p>
<p><img src="/blog/figs/2017-11-29-monte-carlo-tree-search/unnamed-chunk-5-1.png" alt="" /></p>
<p>If this example was interesting to you, checkout our another <a href="http://appsilondatascience.com/blog/business/2017/11/14/dynamic-pricing.html">post about dynamic pricing</a>.</p>
<h2 id="monte-carlo-tree-search">Monte Carlo Tree Search</h2>
<p>We are ready to learn how the Monte Carlo Tree Search algorithm works.</p>
<p>As long as we have enough information to treat child nodes as slot machines, we choose the next node (move) as we would have when solving Multi-Armed Bandit Problem. This can be done when we have some information about the pay-outs for each child node.</p>
<p><img src="/blog/assets/article_images/2017-11-23-monte-carlo-tree-search/multi-tree.jpeg" alt="At the first node it's black's turn. The node with highest upper bound is chosen. Then it's white's turn and again the node with the highest upper bound is chosen." /></p>
<p>At some point we reach the node where we can no longer proceed in this manner because there is at least one node with no statistic present. It’s time to explore the tree to get new information. This can be done either completely randomly or by applying some heuristic methods when choosing child nodes (in practice this might be necessary for games with high branching factor - like chess or Go - if we want to achieve good results).</p>
<p><img src="/blog/assets/article_images/2017-11-23-monte-carlo-tree-search/explore-tree.jpeg" alt="At some point we cannot apply Multi-Armed Bandit procedure, because there is a node which has no stats. We explore new part of the tree." /></p>
<p>Finally, we arrive at a leaf. Here we can check whether we have won or lost.</p>
<p><img src="/blog/assets/article_images/2017-11-23-monte-carlo-tree-search/new-leaf-tree.jpeg" alt="We arrive at a leaf. This determines the outcome of the game." /></p>
<p>It’s time to update the nodes we have visited on our way to the leaf. If the player making a move at the corresponding node turned out to be the winner we increase the number of wins by one. Otherwise we keep it the same. Whether we have won or not, we always increase the number of times the node was played (in the corresponding picture we can automatically deduce it from the number of loses and wins).</p>
<p><img src="/blog/assets/article_images/2017-11-23-monte-carlo-tree-search/updated-tree.jpeg" alt="It's time to update the tree." /></p>
<p>That’s it! We repeat this process until some condition is met: timeout is reached or the confidence intervals we mentioned earlier stabilize (convergence). Then we make our final decision based on the information we have gathered during the search. We can choose a node with the highest upper bound for pay-out (as we would have in each iteration of the Multi-Armed Bandit). Or, if you prefer, choose the one with the highest mean pay-out.</p>
<p>The decision is made, a move was chosen. Now it’s time for our opponent (or opponents). When they’ve finished we arrive at a new node, somewhere deeper in the tree, and the story repeats.</p>
<h2 id="not-just-for-games">Not just for games</h2>
<p>As you might have noticed, Monte Carlo Tree Search can be considered as a general technique for making decisions in perfect information scenarios. Therefore it’s use does not have to be restrained to games only. The most amazing use case I have heard of is to use it for planning interplanetary flights. You can read about it at <a href="http://www.esa.int/gsp/ACT/ai/projects/mcts_traj.html">this website</a> but I will summarize it briefly.</p>
<p>Think of an interplanetary flight as of a trip during which you would like to visit more than one planet. For instance, you are flying from Earth to Jupiter via Mars.</p>
<p>An efficient way to do this is to make use of these planets gravitational field (like they did in ‘The Martian’ movie) so you can take less fuel with you. The question is when the best time to arrive and leave from each planets surface or orbit is (for the first and last planet it’s only leave and arrive, respectively).</p>
<p>You can treat this problem as a decision tree. If you divide time into intervals, at each planet you make a decision: in which time slot I should arrive and in which I should leave. Each such choice determines the next. First of all, you cannot leave before you arrive. Second - your previous choices determine how much fuel you have left and consequently - what places in the universe you can reach.</p>
<p>Each such set of consecutive choices determines where you arrive at the end. If you visited all required checkpoints - you’ve won. Otherwise, you’ve lost. It’s like a perfect information game. There is no opponent and you make a move by determining the timeslot for leave/arrival. This can be treated using the above Monte Carlo Tree Search. As you can read <a href="https://fenix.tecnico.ulisboa.pt/downloadFile/563345090414562/Thesis.pdf">here</a>, it fares quite well in comparison with other known approaches to this problem. And at the same time – it does not require any domain-specific knowledge to implement it.</p>
<hr />
<p>Write your question and comments below. We’d love to hear what you think.</p>
Wed, 29 Nov 2017 00:00:00 +0000
http://appsilondatascience.com/blog/rstats/2017/11/29/monte-carlo-tree-search.html
/blog/rstats/2017/11/29/monte-carlo-tree-search.htmlmontecarlo,treesearch,ai,machinelearningrstatsA Quick Guide to Dynamic Pricing<p><strong>The price you paid for your plane ticket is not the same as the person sitting next to you.</strong></p>
<hr />
<p>You are probably aware of this. Given the time, most of us will research competing rates and try and find the best deal. Travel sites do this for you. What you’ve experienced is called Dynamic Pricing. We’re going to go over what is is, how it works, a few examples and the data involved.</p>
<h2 id="what-is-it">What is it?</h2>
<p>Let’s take trip to the past. You’ve come to the conclusion that it is time to buy an apartment or build a house. So what do you do? Get a loan. You gather all of your financial records, some recommendations from your neighbors and colleagues, and head down to the bank. You spend 20 minutes talking to the banker trying to build rapport to establish your credit worthiness and ability to pay back the loan. As an outstanding member of society with a good job and no blemishes on your record, you expect to be approved without an issue. But you are rejected. And there is a big chances that you jeopardized your chances before opening your mouth. But how?</p>
<p>Your shoes! There was an unwritten rule in banking, that <strong>if you wanted to judge a client’s wealth, just look at his shoes.</strong> You were unaware. You forgot to polish your shoes and even trudged through the mud on your way through town.</p>
<p>Before data-driven decision making, we only had contact with dynamic pricing based on the seller’s emotions and intuition. The bank determined that they would not make money on you, and decided that the risk was too high. Stereotyping of this sort can be identified as ‘click, whir’ behavior. This involves taking mental shortcuts when making a decision and eschew complete analysis. Read more on this in Cialdini’s book <a href="https://www.goodreads.com/book/show/28815.Influence">Influence: The Psychology of Persuation</a>.</p>
<p><strong>Dynamic pricing is a method of maximising profit margins by basing prices on the customer’s propensity to pay, taking into account a gamut of data points.</strong> There is not a single way to go about this, but rather, a realm of possibilities that can be considered depending on the industry, data, and desired results. Industries include travel, entertainment, hospitality, retail and even finance. You can find applications in many more industries. These are all examples where demand can be actively monitored, correlated with historical data and used to increase margins. Prices change according to demand, but also according to predicted demand. AI, or more precisely, machine learning is used for the most advanced cases.</p>
<p>From an economics perspective, this is a simple concept. Think back to Econ 101; what was the first thing the Professor presented? It was probably Supply and Demand. The basic premise is that price is determined by the market. More precisely, competing forces find a balance between consumer’s willingness to pay and the supplier’s efficiency in providing goods and services. Both parties should feel satisfied with the price. Theory, however, can only tell you so much about reality. The supply and demand curve shows that an increase in price, hold for veblen goods, results in decreased demand. There are still clients at this higher rate. Calculating the chances a buyer will forego a purchase because of a rate increase allows sellers to adapt the price to each individual customer. After all, businesses want to maximize their profits.</p>
<p>Appsilon has worked on projects like this, namely for B2B clients. A particular application that stands out was for a very large equipment manufacturer. Prices were based not only on the client’s history, but also the order size and the predicted life time value.</p>
<h2 id="transcending-time">Transcending time</h2>
<p>You may have recently purchased a new vehicle. Unless it was a Tesla, you negotiated on the price. This is a prime example of price elasticity, where both parties actively find a price equilibrium. Both sides want to come to a price that benefits them. The salesman knows how much the car must go for in order to make a profit. An educated buyer can see the MSRP (Manufacturer’s Suggested Retail Price) on the window, but also knows what the average price paid was. After some back and forth, a price is found that satisfies everyone.</p>
<p>This has been going on for decades, but online shopping looks different than it did ten years ago. The quote you see on a vehicle or advertisement varies widely. You are no longer a hard working engineer who was marginalized because they walked through a puddle. <strong>Today, the price you are quoted is based on data such as your willingness to pay, your preferences, financial status and more.</strong></p>
<h2 id="but-how">But how…?</h2>
<p>The advent of social media has led to an increase in attention directed at oneself. Millennials have even been called the ‘Me’ generation. Customers, young and old, have begun to expect highly personalized experiences. Advertisers have been trying to find out who their material is being shown to they could measure retention and ROI (Return on Investment). Billboards target a certain region, taking into account the traffic that drives by this. You can intuitively understand that unless you are selling a utility or public service, this approach brings little more than empty talk. Internet advertisers, especially those who use social networks in combination with their data, can create user segments made of 1. <strong>You are your own segment.</strong> Your preferences, likes, desires, behavior, and socioeconomic information is used to create a refined profile. You are shown products that reflect your needs and wants, and not those of your neighbor.</p>
<p>Being offered a product that you actually want is nice. You’ll even be pleasantly surprised sometimes. It is only logical to take the data collected and apply it to prices. Herein lies the secret sauce of dynamic pricing. Machine learning (and deep learning) or AI, as portrayed in the media, is only as useful as the data it feeds on. As such, fine grained data make for refined insights. Such data can be used to train a model that help fill in the missing information or behavior of customers who are less active on social media.</p>
<p>Online shopping has never been the same. Not only are the product recommendations personalized just to your liking but also the prices. Customers and business rejoice. Everyone gets what they want.</p>
<h2 id="tell-me-about-the-data">Tell me about the data</h2>
<p>AI works by leveraging large data sets to train models. Training is the process of feeding examples of the outcome we would like until the model begins to recognize it. This is done by varying the statistical weights. Neural networks are frequently used among other methods, such as tree-based and gradient boosting approaches. They are essentially layers of abstraction that are useful for a few reasons. They can speed up processing of large data sets by representing data as an abstraction. They can be trained to learn based on experience and update themselves over time. There are ways to add algorithmic memory to include context in our process. There are three main fields where all of this takes place: data science, machine learning and deep learning. Data science encompases the other two, as well many other statistical and analytic methods. Deep learning is specific subset of machine learning that is currently the most popular method in image, speech and text recognition, as it has surpasses human level performance in a multum of tasks.</p>
<h2 id="makes-sense-but-why">Makes sense, but why?</h2>
<p>Why would a company charge you more? Well, for starters, because you can afford it. There is a maximum price we will pay for a given product. It varies based on the time and place, but also on our needs and recently, desires. Another reason for such changes in price, namely discounted prices, is to increase customer retention and loyalty. Lastly, certain businesses have sunk costs that they would rather cove. It’s better not to lose money than to make money. <strong>Let us know if you think your business could benefit from dynamic pricing and make sure you aren’t leaving money on the table. </strong></p>
Tue, 14 Nov 2017 00:00:00 +0000
http://appsilondatascience.com/blog/business/2017/11/14/dynamic-pricing.html
/blog/business/2017/11/14/dynamic-pricing.htmleconomy,price,pricing,dynamicpricing,custombusinessAn example of how to use the new R promises package<p><strong>The long awaited promises will be released soon!<strong></strong></strong></p>
<p><img src="/blog/assets/article_images/2017-11-01-r-promises-hands-on/yay.gif" alt="Can't wait to play with promises :)" /></p>
<p>Being as impatient as I am when it comes to new technology, I decided to play with currently available implementation of promises that Joe Cheng shared and presented recently in London at <a href="https://earlconf.com/london/">EARL conference</a>.</p>
<p>From this article you’ll get to know the upcoming <code class="highlighter-rouge">promises</code> package, how to use it and how it is different from the already existing <code class="highlighter-rouge">future</code> package.</p>
<p>Promises/Futures are a concept used in almost every major programming language. We’ve used Tasks in C#, Futures in Scala, Promises in Javascript and they all adhere to a common understanding of what a promise is.</p>
<p>If you are not familiar with the concept of Promises, asynchronous tasks or Futures, I advise you to take a longer moment and dive into the topic. If you’d like to dive deeper and achieve a higher level of understanding, read about Continuation Monads in Haskell. We’ll be comparing the new <code class="highlighter-rouge">promises</code> package with the <code class="highlighter-rouge">future</code> package, which has been around for a while so I suggest you take a look at it <a href="https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html">https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html</a> first if you haven’t used it before.</p>
<p>Citing Joe Cheng, our aim is to:</p>
<ol>
<li>Execute long-running code asynchronously on separate thread.</li>
<li>Be able to do something with the result (if success) or error (if failure), when the task completes, back on the main R thread.</li>
</ol>
<p>A promise object represents the eventual result of an async task. A promise is an R6 object that knows:</p>
<ol>
<li>Whether the task is running, succeeded, or failed</li>
<li>The result (if succeeded) or error (if failed)</li>
</ol>
<hr />
<p><strong>Without further ado, let’s get our hands on the code!</strong> You should be able to just copy-paste code into RStudio and run it.</p>
<p>R is single threaded. This means that user cannot interact with your shiny app if there is a long running task being executed on the server. Let’s take a look at an example:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 10s
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w">
</span><span class="n">sumAC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="w">
</span><span class="n">sumBC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 15s
</span><span class="n">print</span><span class="p">(</span><span class="n">sumAC</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">sumBC</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">Time</span><span class="o">:</span><span class="w"> </span><span class="m">15</span><span class="n">s</span></code></pre></figure>
<p>We’ll use a simplified version of user interaction while there are some additional computations happening on the server. Let’s assume that we can’t just put all the computations in a separate block of code and just run it separately using the <code class="highlighter-rouge">future</code> package. There are many cases when it is very difficult or even almost impossible to just gather all computations and run them elsewhere as one big long block of code.</p>
<p>User cannot interact with the app for 10 seconds until the computations are finished and then the user has to wait another 5 seconds for next interaction. This is not a place where we would like to be in. User interactions should be as fast as possible and the user shouldn’t have to wait if it is not required. Let’s fix that using R <code class="highlighter-rouge">future</code> package that we know.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">install.packages</span><span class="p">(</span><span class="s2">"future"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 0s
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">10</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">a</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">b</span><span class="p">))</span><span class="w">
</span><span class="n">sumAC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">value</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">value</span><span class="p">(</span><span class="n">c</span><span class="p">)</span><span class="w">
</span><span class="n">sumBC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">value</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">value</span><span class="p">(</span><span class="n">c</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 5s
</span><span class="n">print</span><span class="p">(</span><span class="n">sumAC</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">sumBC</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">Time</span><span class="o">:</span><span class="w"> </span><span class="m">5</span><span class="n">s</span></code></pre></figure>
<p>Nice, now the first user interaction can happen in parallel! But the second interaction is still blocked - we have to wait for the values, to print their sum. In order to fix that we’d like to chain the computation into the summing function instead of waiting synchronously for the result. We can’t do that using pure futures though (assuming we can’t just put all these computations in one single block of code and run it in parallel). Ideally we’d like to be able to write code similar to the one below:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 0s
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">10</span><span class="p">))</span><span class="w">
</span><span class="n">future</span><span class="p">(</span><span class="n">print</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">a</span><span class="p">)))</span><span class="w">
</span><span class="n">future</span><span class="p">(</span><span class="n">print</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">b</span><span class="p">)))</span><span class="w">
</span><span class="n">sumAC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">value</span><span class="p">(</span><span class="n">c</span><span class="p">))</span><span class="w">
</span><span class="n">sumBC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">value</span><span class="p">(</span><span class="n">c</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 0s
</span><span class="n">future</span><span class="p">(</span><span class="n">print</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">sumAC</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">value</span><span class="p">(</span><span class="n">sumBC</span><span class="p">)))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">Time</span><span class="o">:</span><span class="w"> </span><span class="m">0</span><span class="n">s</span></code></pre></figure>
<p>Unfortunately <code class="highlighter-rouge">future</code> package won’t allow us to do that.</p>
<hr />
<p>What we can do, is use the <a href="https://github.com/rstudio/promises">promises</a> package from RStudio!</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"rstudio/promises"</span><span class="p">)</span></code></pre></figure>
<hr />
<p><strong>Let’s play with the promises!</strong> I simplified our example to let us focus on using promises first:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="n">tibble</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">value</span><span class="p">(</span><span class="n">a</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">Time</span><span class="o">:</span><span class="w"> </span><span class="m">5</span><span class="n">s</span></code></pre></figure>
<p>We’d like to chain the result of <code class="highlighter-rouge">longRunningFunction</code> to a print function so that once the <code class="highlighter-rouge">longRunningFunction</code> is finished, its results are printed.</p>
<p>We can achieve that by using %…>% operator. It works like the very popular <a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html">%>% operator from magrittr</a>. Think of <code class="highlighter-rouge">%...>%</code> as “sometime in the future, once I have the result of the operation, pass the result to the next function”. The three dots symbolise the fact that we have to wait and that the result will be passed in future, it’s not happening now.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">promises</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="n">tibble</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="n">print</span><span class="p">()</span><span class="w"> </span><span class="c1"># Time: 5s
</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">Time</span><span class="o">:</span><span class="w"> </span><span class="m">0</span><span class="n">s</span></code></pre></figure>
<p>Pure magic.</p>
<p>But what if I want to filter the result first and then print the processed data? Just keep on chaining:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">promises</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="n">tibble</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="nf">sum</span><span class="p">()</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span></code></pre></figure>
<p>Neat. But, how can I print the result of filtering and pass it to the <code class="highlighter-rouge">sum</code> function? There is a tee operator, the same as the one magrittr provides (but one that operates on a promise). It will pass the result of the function to the next function. If you chain it further, it will not pass the result of <code class="highlighter-rouge">print()</code> function but previous results. Think of it as splitting a railway, printing the value on a side track and ending the run, then getting back to the main track:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">promises</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="n">tibble</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%...T>%</span><span class="w">
</span><span class="n">print</span><span class="p">()</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="nf">sum</span><span class="p">()</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span></code></pre></figure>
<p>What about errors? They are being thrown somewhere else than in the main thread, how can I catch them? You guessed it - there is an operator for that as well. Use <code class="highlighter-rouge">%...!%</code> to handle errors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">promises</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"ERROR"</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="n">tibble</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">number</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%...T>%</span><span class="w">
</span><span class="n">print</span><span class="p">()</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="nf">sum</span><span class="p">()</span><span class="w"> </span><span class="o">%...>%</span><span class="w">
</span><span class="n">print</span><span class="p">()</span><span class="w"> </span><span class="o">%...!%</span><span class="w">
</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">error</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"Unexpected error: "</span><span class="p">,</span><span class="w"> </span><span class="n">error</span><span class="o">$</span><span class="n">message</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span></code></pre></figure>
<p>But in our example we’re not just chaining one computation. There is a <code class="highlighter-rouge">longRunningFunction</code> call that eventually returns 1 and another one that eventually returns 2. We need to somehow join the two. Once both of them are ready, we’d like to use them and return the sum. We can use <code class="highlighter-rouge">promise_all</code> function to achieve that. It takes a list of promises as an argument and returns a promise that eventually resolves to a list of results of each of the promises.</p>
<p>Perfect. We know the tools that we can use to chain asynchronous functions. Let’s use them in our example then:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="n">multiprocess</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">promises</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span><span class="w">
</span><span class="n">longRunningFunction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 0s
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="n">longRunningFunction</span><span class="p">(</span><span class="m">10</span><span class="p">))</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">%...>%</span><span class="w"> </span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o">%...>%</span><span class="w"> </span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="n">sumAC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">promise_all</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="n">c</span><span class="p">)</span><span class="w"> </span><span class="o">%...>%</span><span class="w"> </span><span class="n">reduce</span><span class="p">(</span><span class="n">`+`</span><span class="p">)</span><span class="w">
</span><span class="n">sumBC</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">promise_all</span><span class="p">(</span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="n">c</span><span class="p">)</span><span class="w"> </span><span class="o">%...>%</span><span class="w"> </span><span class="n">reduce</span><span class="p">(</span><span class="n">`+`</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time: 0s
</span><span class="n">promise_all</span><span class="p">(</span><span class="n">sumAC</span><span class="p">,</span><span class="w"> </span><span class="n">sumBC</span><span class="p">)</span><span class="w"> </span><span class="o">%...>%</span><span class="w"> </span><span class="n">reduce</span><span class="p">(</span><span class="n">`+`</span><span class="p">)</span><span class="w"> </span><span class="o">%...>%</span><span class="w"> </span><span class="n">print</span><span class="p">()</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"User interaction"</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">Time</span><span class="o">:</span><span class="w"> </span><span class="m">0</span><span class="n">s</span></code></pre></figure>
<p>A task for you - in line <code class="highlighter-rouge">sumAC <- promise_all(a, c) %...>% reduce(</code>+<code class="highlighter-rouge">)</code>, print the list of values from promises <code class="highlighter-rouge">a</code> and <code class="highlighter-rouge">c</code> before they are summed up.</p>
<p><strong>Handful of useful information:</strong></p>
<p>[1] There is support for promises implemented in shiny but neither CRAN nor GitHub master branch versions of Shiny support promises. Until support is merged, you’ll have to install from async branch:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"rstudio/shiny@async"</span><span class="p">)</span></code></pre></figure>
<p>[2] Beta-quality code at <a href="https://github.com/rstudio/promises">https://github.com/rstudio/promises</a></p>
<p>[3] Early drafts of docs temporarily hosted at: <a href="https://medium.com/@joe.cheng">https://medium.com/@joe.cheng</a></p>
<p>[4] Joe Cheng talk on EARL 2017 in London - <a href="https://www.dropbox.com/s/2gf6tfk1t345lyf/async-talk-slides.pdf?dl=0">https://www.dropbox.com/s/2gf6tfk1t345lyf/async-talk-slides.pdf?dl=0</a></p>
<p>[5] The plan is to release everything on CRAN by end of this year.</p>
<p><strong>I hope you have as much fun playing with the promises as I did!</strong> I’m planning to play with shiny support for promises next.</p>
<hr />
<p>Till next time!</p>
Wed, 01 Nov 2017 00:00:00 +0000
http://appsilondatascience.com/blog/rstats/2017/11/01/r-promises-hands-on.html
/blog/rstats/2017/11/01/r-promises-hands-on.htmlshinyrpromisesfuturerstudiopromisetaskmonadsrstatsHow we built a Shiny App for 700 users?<p>One of our senior data scientists, Olga Mierzwa-Sulima spoke at the userR! conference in Brussels to a packed house. The seats were full and there were audience members spilling out the doors.</p>
<p><img src="/blog/assets/article_images/2017-09-05-scaling-shiny/audience2.png" />
<img src="/blog/assets/article_images/2017-09-05-scaling-shiny/audience.png" />
Source: https://twitter.com/matlabulous/status/882530484374392834</p>
<p>Olga’s talk was entitled <a href="https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/How-we-built-a-Shiny-App-for-700-users">‘How we built a Shiny App for 700 users?’</a> She went over the main challenges associated with scaling a Shiny application, and the methods we used to resolve them. The talk was partly in the form of a case study based on Appsilon’s experience.</p>
<div class="container" style="position: relative; width: 100%; height: 0; padding-bottom: 56.25%; margin: 2em 0;">
<iframe src="https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/How-we-built-a-Shiny-App-for-700-users/player" allowfullscreen="" frameborder="0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe>
</div>
<p>In this talk, Olga shared her experience from a real-life case study of building an app used daily by 700 users where our data science team tackled all these problems. This, to our knowledge, was one of the biggest production deployments of a Shiny App.</p>
<p>Shiny has proved itself a great tool for communicating data science teams’ results. However, developing a Shiny app for a large scope project that will be used commercially by more than dozens of users is not easy.</p>
<p>The first challenge is User Interface (UI): the expectations are that the app should not vary from modern web pages.</p>
<p>Secondly, performance directly impacts user experience (UX), and it’s difficult to maintain efficiency with growing requirements and user base.</p>
<p>She showed an innovative approach to building a beautiful and flexible Shiny UI using <a href="https://appsilon.github.io/shiny.semantic/"><em>shiny.semantic</em></a> package (an alternative to standard Bootstrap). Furthermore, she talked about non-standard optimization tricks we implemented to gain performance. Then she discussed challenges regarding complex reactivity and offered solutions. Finally, she went through implementation and deployment process of the app using a load balancer.</p>
<p>Appsilon’s open-source packages allow you to build superior Shiny dashboards. Make sure to check out <a href="https://appsilon.github.io/shiny.semantic/"><em>shiny.semantic</em></a>, <a href="https://appsilon.github.io/shiny.collections/"><em>shiny.collections</em></a>, and <a href="https://appsilon.github.io/shiny.router/"><em>shiny.router</em></a> and follow us on <a href="https://twitter.com/AppsilonDS"><em>twitter</em></a> to get updates about <a href="https://github.com/Appsilon/shiny.i18n"><em>shiny.i18n</em></a> — our brand new package for internationalization.</p>
Tue, 17 Oct 2017 00:00:00 +0000
http://appsilondatascience.com/blog/rstats/2017/10/17/scaling-shiny.html
/blog/rstats/2017/10/17/scaling-shiny.htmlshinyrchatshiny.collectionsrstatsHow to sample from multidimensional distributions using Gibbs sampling?<p><strong>We will show how to perform multivariate random sampling using one of the Markov Chain Monte Carlo (MCMC) algorithms, called the Gibbs sampler.</strong></p>
<hr />
<p><small>Random sampling with rabbit on the bed plane</small></p>
<iframe src="https://giphy.com/embed/49euH7z1UKBy0" width="480" height="480" frameborder="0" class="giphy-embed" allowfullscreen=""></iframe>
<p><a href="https://giphy.com/gifs/rabbit-49euH7z1UKBy0">via GIPHY</a></p>
<hr />
<p><strong>To start, what are MCMC algorithms and what are they based on?</strong></p>
<p>Suppose we are interested in generating a random variable <script type="math/tex">X</script> with a distribution of <script type="math/tex">\pi</script>, over <script type="math/tex">\mathcal{X}</script>.<br />
If we are not able to do this directly, we will be satisfied with generating a sequence of random variables <script type="math/tex">X_0, X_1, \ldots</script>, which in a sense tending to a distribution of <script type="math/tex">\pi</script>. The MCMC approach explains how to do so:</p>
<p>Build a Markov chain <script type="math/tex">X_0, X_1, \ldots</script>, for <script type="math/tex">\mathcal{X}</script>, whose stationary distribution is <script type="math/tex">\pi</script>.
If the structure is correct, we should expect random variables to converge to <script type="math/tex">\pi</script>.</p>
<p>In addition, we can expect that
for function <script type="math/tex">f: \mathcal{X} \rightarrow \mathbb{R}</script>, occurs:<br />
<script type="math/tex">\mathbb{E}(f) = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=0}^{n-1}f(X_{i})</script></p>
<p>with probability equals to 1.</p>
<p>In essence, we want the above equality to occur for any arbitrary random variable <script type="math/tex">X_0</script>.</p>
<hr />
<p><strong>One of the MCMC algorithms that guarantee the above properties is the so-called Gibbs sampler.</strong></p>
<p><strong>Gibbs Sampler - description of the algorithm.</strong></p>
<p>Assumptions:</p>
<ul>
<li><script type="math/tex">\pi</script> is defined on the product space <script type="math/tex">\mathcal{X} = \Pi_{i=1}^{d}\mathcal{X_{i}}</script></li>
<li>We are able to draw from the conditional distributions
<script type="math/tex">\pi(X_{i}|X_{-i})</script>,
where
<script type="math/tex">X_{-i} = (X_{1}, \ldots X_{i-1}, X_{i+1}, \ldots X_{d})</script></li>
</ul>
<p>Algorithm steps:</p>
<ol>
<li>Select the initial values <script type="math/tex">X_{k}^{(0)}, k = 1, \ldots d</script></li>
<li>For <script type="math/tex">t=1,2,\ldots</script> repeat:
<blockquote>
<p>For <script type="math/tex">k = 1, \ldots d</script> sample <script type="math/tex">X_{k}^{(t)}</script> from distribution <script type="math/tex">\pi(X_{k}^{(t)}\rvert X_{1}^{(t)}, \ldots X_{k-1}^{(t)}, X_{k+1}^{(t-1)}, \ldots X_{d}^{(t-1)})</script></p>
</blockquote>
</li>
<li>Repeat step 2 until the distribution of vector <script type="math/tex">(X_{1}^{(t)}, \ldots X_{d}^{(t)})</script> stabilizes.</li>
<li>The subsequent iterations of point 2 will provide a randomization of <script type="math/tex">\pi</script>.</li>
</ol>
<p>How do we understand point 3?<br />
We are most satisfied when average coordinates stabilize to some accuracy.</p>
<hr />
<p><strong>Intuition in two-dimensional case:</strong><br />
<img src="/blog/assets/article_images/2017-10-02-gibbs-sampling/gibbs.png" width="100%" align="center" alt="Gibbs sampling on the plane." /><br />
<small>Source: <a href="http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm">[3]</a></small></p>
<hr />
<p><strong>Gibbs sampling for randomization with a two-dimensional normal distribution.</strong></p>
<p>We will sample from the distribution of <script type="math/tex">\theta \sim \mathcal{N}(0, [\sigma_{ij}])</script>, where
<script type="math/tex">\sigma_{11} = \sigma_{22} = 1</script> and <script type="math/tex">\sigma_{12} = \sigma_{21} = \rho</script>.</p>
<p>Knowing joint density of <script type="math/tex">\theta</script>, it’s easy to show, that:
<script type="math/tex">\theta_{1}|\theta_{2} \sim \mathcal{N}(\rho\theta_{2}, 1-\rho^{2})</script><br />
<script type="math/tex">\theta_{2}|\theta_{1} \sim \mathcal{N}(\rho\theta_{1}, 1-\rho^{2})</script></p>
<p><br />
R implementation:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gibbs_normal_sampling</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">,</span><span class="w"> </span><span class="n">init_point</span><span class="p">,</span><span class="w"> </span><span class="n">ro</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># init point is some numeric vector of length equals to 2
</span><span class="w"> </span><span class="n">theta_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">init_point</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">numeric</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">))</span><span class="w">
</span><span class="n">theta_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">init_point</span><span class="p">[</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">numeric</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="n">n_iteration</span><span class="m">+1</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">theta_1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">ro</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">theta_2</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">],</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">ro</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">theta_2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">ro</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">theta_1</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">ro</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="n">theta_1</span><span class="p">,</span><span class="w"> </span><span class="n">theta_2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p><br />
Using the above function, let’s see how the 10 points were scored at <script type="math/tex">\rho = 0.5</script>:<br />
<img src="/blog/assets/article_images/2017-10-02-gibbs-sampling/animation.gif" alt="Sampling ten points via gibbs sampler - animation." /></p>
<p>And for 10000 iterations:<br />
<img src="/blog/assets/article_images/2017-10-02-gibbs-sampling/sampler.png" style="align:center" width="100%" alt="Sampling from bivariate normal distribution." /></p>
<p>We leave you a comparison of how the stability of the parameters changes depending on the selected <script type="math/tex">\rho</script> parameter.</p>
<hr />
<p><strong>Let’s move on to use the Gibbs sampler to estimate the density parameters.</strong></p>
<p>We will show the use of the Gibbs sampler and bayesian statistics to estimate the mean parameters in the mix of normal distributions.</p>
<p>Assumptions (simplified case):</p>
<ul>
<li>iid. sample <script type="math/tex">Y=Y_{1},\ldots Y_{N}</script> comes from a mixture of normal distributions <script type="math/tex">(1-\pi)\mathcal{N}(\mu_{1}, \sigma_{1}^{2})+\pi\mathcal{N}(\mu_{2}, \sigma_{2}^{2})</script>, where <script type="math/tex">\pi</script>, <script type="math/tex">\sigma_{1}</script> i <script type="math/tex">\sigma_{2}</script> are known.</li>
<li>For i=1,2 <script type="math/tex">\mu_{i} \sim \mathcal{N}(0, 1)</script> (a priori distributions) and <script type="math/tex">\mu_{1}</script> with <script type="math/tex">\mu_{2}</script> are independent.</li>
<li><script type="math/tex">\Delta = (\Delta_{1}, \ldots, \Delta_{N})</script> - the classification vector (unobserved) from which Y is derived (when <script type="math/tex">\Delta_{i} = 0</script>, <script type="math/tex">Y_{i}</script> is drawn from <script type="math/tex">\mathcal{N}(\mu_{1}, \sigma_{1}^{2})</script>, when <script type="math/tex">\Delta_{i} = 1</script> then drawn from <script type="math/tex">\mathcal{N}(\mu_{2}, \sigma_{2}^{2})</script>).</li>
<li>
<script type="math/tex; mode=display">\mathbb{P}(\Delta_{i} = 1) = \pi</script>
</li>
</ul>
<p>With above assumptions it can be shown that:</p>
<ul>
<li>
<script type="math/tex; mode=display">f(\Delta) = (1-\pi)^{N-\sum_{i=1}^{N}\Delta_{i}}\cdot\pi^{\sum_{i=1}^{N}\Delta_{i}}</script>
</li>
<li>
<script type="math/tex; mode=display">(\mu_{1}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}(1-\Delta_{i})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i})}, \frac{1}{1+\sum_{i=1}^{N}(1-\Delta_{i})})</script>
</li>
<li>
<script type="math/tex; mode=display">(\mu_{2}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}\Delta_{i}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}}, \frac{1}{1+\sum_{i=1}^{N}\Delta_{i}})</script>
</li>
<li>
<script type="math/tex; mode=display">\mathbb{P} (\Delta_{i} = 1\rvert\mu_{1}, \mu_{2}, Y) = \frac{\pi \phi_{(\mu_{2}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_1, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_2, \sigma_{2})}(y_{i})}</script>
</li>
</ul>
<p>The form of the algorithm:</p>
<ol>
<li>Choose starting point for the mean <script type="math/tex">(\mu_{1}^{0}, \mu_{2}^{0})</script></li>
<li>In the <script type="math/tex">k</script>-th iteration do:
<blockquote>
<ul>
<li>With the probability:<br /><script type="math/tex">\frac{\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_{1}^{(k-1)}, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}</script> draw <script type="math/tex">\Delta_{i}^{(k)}</script> for <script type="math/tex">i = 1, \ldots, N</script></li>
<li>Calculate:<br /><script type="math/tex">\hat{\mu_{1}} = \frac{\sum_{i=1}^{N}(1-\Delta_{i}^{(k)})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})}</script><br /><script type="math/tex">\hat{\mu_{2}} = \frac{\sum_{i=1}^{N}\Delta_{i}^{(k)}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}}</script></li>
<li>Generate:<br /><script type="math/tex">\mu_{1}^{(k)} \sim \mathcal{N}(\hat{\mu_{1}}, \frac{1}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})})</script><br /><script type="math/tex">\mu_{2}^{(k)} \sim \mathcal{N}(\hat{\mu_{2}}, \frac{1}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}})</script></li>
</ul>
</blockquote>
</li>
<li>Perform step 2. until the distribution of vector <script type="math/tex">(\Delta, \mu_{1}, \mu_{2})</script> stabilizes.</li>
</ol>
<hr />
<p>How to do this in R?</p>
<p>At start let’s generate random sample from mixture of normals with parameters <script type="math/tex">(\pi, \mu_1, \mu_2, \sigma_1, \sigma_2) = (0.7, 10, 2, 1, 2)</script>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">12345</span><span class="p">)</span><span class="w">
</span><span class="n">mu_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">mu_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">sigma_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">sigma_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">pi_known</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0.7</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2000</span><span class="w">
</span><span class="n">pi_sampled</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">pi_known</span><span class="p">)</span><span class="w">
</span><span class="n">y_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mu_1</span><span class="p">,</span><span class="w"> </span><span class="n">sigma_1</span><span class="p">)</span><span class="w">
</span><span class="n">y_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mu_2</span><span class="p">,</span><span class="w"> </span><span class="n">sigma_2</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pi_sampled</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">y_1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">pi_sampled</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">y_2</span></code></pre></figure>
<p>Take a quick look at a histogram of our data:</p>
<p><img src="/blog/figs/2017-10-02-gibbs-sampling/unnamed-chunk-4-1.png" alt="" /></p>
<p>The following task is to estimate <script type="math/tex">\mu_1</script> and <script type="math/tex">\mu_2</script> from above model.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mu_init</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">n_iterations</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">300</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="s2">"list"</span><span class="p">,</span><span class="w"> </span><span class="n">n_iterations</span><span class="p">)</span><span class="w">
</span><span class="n">mu_1_vec</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">mu_init</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">numeric</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">))</span><span class="w">
</span><span class="n">mu_2_vec</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">mu_init</span><span class="p">[</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">numeric</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">))</span><span class="w">
</span><span class="n">delta_probability</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">mu_1_vec</span><span class="p">,</span><span class="w"> </span><span class="n">mu_2_vec</span><span class="p">,</span><span class="w"> </span><span class="n">sigma_1</span><span class="p">,</span><span class="w"> </span><span class="n">sigma_2</span><span class="p">,</span><span class="w"> </span><span class="n">pi_known</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pi_known</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">mu_2_vec</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">sigma_2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="p">((</span><span class="m">1</span><span class="o">-</span><span class="w"> </span><span class="n">pi_known</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">mu_1_vec</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">sigma_1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">pi_known</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">mu_2_vec</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">sigma_2</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">mu_1_mean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">sum</span><span class="p">((</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]]))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">mu_2_mean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]]))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="n">n_iterations</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">delta</span><span class="p">[[</span><span class="n">j</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">map_int</span><span class="p">(</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_probability</span><span class="p">(</span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">mu_1_vec</span><span class="p">,</span><span class="w"> </span><span class="n">mu_2_vec</span><span class="p">,</span><span class="w"> </span><span class="n">sigma_1</span><span class="p">,</span><span class="w"> </span><span class="n">sigma_2</span><span class="p">,</span><span class="w"> </span><span class="n">pi_known</span><span class="p">)))</span><span class="w">
</span><span class="n">mu_1_vec</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mu_1_mean</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">delta</span><span class="p">[[</span><span class="n">j</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]]))))</span><span class="w">
</span><span class="n">mu_2_vec</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mu_2_mean</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">j</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]]))))</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p><br />
Let’s see the relation of sampled data to known one:</p>
<p>The following plot presents the mean of the <script type="math/tex">\Delta</script> vector at each iteration:</p>
<p><br /></p>
<p><img src="/blog/figs/2017-10-02-gibbs-sampling/unnamed-chunk-6-1.png" alt="" /></p>
<p><br /></p>
<p>Let’s check how mean of parameters <script type="math/tex">\mu_1</script> and <script type="math/tex">\mu_2</script> stabilize at used algorithm:</p>
<p><img src="/blog/figs/2017-10-02-gibbs-sampling/unnamed-chunk-7-1.png" alt="" /></p>
<p><br /><br />
Note how little iteration was enough to stabilize the parameters.</p>
<p><br /></p>
<p>Finally let’s see estimated density with our initial sample:</p>
<p><img src="/blog/figs/2017-10-02-gibbs-sampling/unnamed-chunk-8-1.png" alt="" /></p>
<hr />
<p>To those concerned about the topic, refer to <a href="http://gauss.stat.su.se/rr/RR2006_1.pdf">[1]</a> where you can find a generalization of normal distribution mixture by extending a priori distributions to other parameters.</p>
<p>It is also worth to compare the above algorithm with its deterministic counterpart, Expectation Maximization (EM) algorithm see <a href="http://web.stanford.edu/~hastie/ElemStatLearn/">[2]</a>.</p>
<hr />
<p>Write your question and comments below. We’d love to hear what you think.</p>
<hr />
<p>Resources:</p>
<p>[1] http://gauss.stat.su.se/rr/RR2006_1.pdf</p>
<p>[2] http://web.stanford.edu/~hastie/ElemStatLearn/</p>
<p>[3] http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm</p>
Mon, 09 Oct 2017 00:00:00 +0000
http://appsilondatascience.com/blog/rstats/2017/10/09/gibbs-sampling.html
/blog/rstats/2017/10/09/gibbs-sampling.htmlgibbs,sampler,mixture,mcmcrstatsSuper excited for R promises<p>We at <a href="http://appsilondatascience.com/">Appsilon</a> are excited about RStudio introducing promises in R quite soon which is going to be a huge step forward in programming in R (we have already used futures and similar libraries to run code asynchronously, however this is going to be a standard and it looks like it’s going to be very easy to use). They support chaining which is a great way of building clean code by piping computations that take a long time.</p>
<p><img src="/blog/assets/article_images/2017-09-22-shiny-promises/excited.gif" alt="Our team cannot wait to use Promises :)" /></p>
<p>We’ve used futures/promises/tasks in programming languages like C#, Scala, Javascript and promises have always had a big impact on the way code is structured and on the overall execution speed.</p>
<p>We recently spoke with Joe Cheng, and he mentioned promises at the EARL conference. Here are some links if you’re interested in reading more about promises on <a href="https://github.com/rstudio/promises">github</a> and <a href="https://medium.com/@joe.cheng/async-programming-in-r-and-shiny-ebe8c5010790">medium</a>.</p>
<p>What do you think? Let us know in the comments.</p>
Mon, 25 Sep 2017 00:00:00 +0000
http://appsilondatascience.com/blog/rstats/2017/09/25/shiny-promises.html
/blog/rstats/2017/09/25/shiny-promises.htmlrrstudiopromicesscalajavascriptasynchronousrstatsOur experience at EARL London 2017<p>Earlier this week, I had the immense pleasure to attend EARL in London. I flew in from Warsaw on Tuesday and headed straight to my room. The event started with an evening reception, which was great for networking, obviously. I didn’t believe that the weather in London could change so quickly, but as it turns out, I was wrong. When I stepped outside after the reception, I was faced with squalls of rain and strong gusts of wind.</p>
<p>On the day of my talk, I woke up and was surprised to see a sunny blue sky. I had done my research and knew exactly which presentations I wanted to see. Cathy Atkinson from the Department for Business, Energy and Industrial Strategy gave a very detailed talk on ‘What to do when your data is words.’ One of the most interesting things she discussed was that processing data to remove various stops words and punctuation. Some phrases, like ‘state of the art’, lose their meaning when stop words are removed. One of the things she mentioned where methods to counteract these kinds of issues. After processing the data, she was able to convert it into numbers and treat it as if it were any other data set. She was very methodical in her presentation, which allowed the audience to fully understand what they would have to do to recreate the process.</p>
<p>James Lawrence, from the Behavioral Insights Team, gave my favorite talk of the second session on ‘Reducing traffic deaths and serious injuries.’ This talk was not as technical as Cathy’s, but had great storytelling and was very insightful. It turns out that East Sussex County has a surprising number of traffic deaths. What was most shocking is that a motorcyclist in a crash at highway speeds has more than a 50% chance of facing serious injuries or death. What’s more, residents of East Sussex were not responsible for most of these accidents. Most serious accidents were outside of cities on the roads connecting them. This allowed local police officers to be more informed about the location of their caution signs. As this is an R conference, James presented how they used R for analysis and to present their results.</p>
<p>I spoke during the second session on ‘Scaling Shiny to 700 users’ You can take a look at the slides here. Feel free to comment below if you have any questions.</p>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/1ZMaDLH57USIka" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
<div style="margin-bottom:5px"> <strong> <a href="//www.slideshare.net/secret/1ZMaDLH57USIka" title="Scaling Shiny apps to 700 users at EARL London_2017" target="_blank">Scaling Shiny apps to 700 users at EARL London_2017</a> </strong> from <strong><a href="https://www.slideshare.net/appsilon" target="_blank">Appsilon Data Science</a></strong> </div>
<p>The evening reception topped off the evening with a cruise along the Thames from Parliment all the way to the Thames Barrier. The views were amazing, providing a unique perspective of London.</p>
<p>I hope to see you at the next EARL.</p>
Fri, 15 Sep 2017 00:00:00 +0000
http://appsilondatascience.com/blog/business/2017/09/15/earl-london.html
/blog/business/2017/09/15/earl-london.htmlshinyrscaleshiny.semanticearllondonbusinessHow to internationalize your Shiny apps with shiny.i18n?<p>We pride ourselves on the fact that we work on some of the most difficult business applications of Shiny. These range from sales, manufacturing, management tools, to even satellite data analysis tools.</p>
<p>Many of our clients include multinational enterprises that are continuously looking for new markets to enter. Most markets, however, have a common issue that needs addressing: language barrier. In particular, we have seen an increasing need for dashboards that are tuned to the needs of diverse employees worldwide.</p>
<p>This means that every country or region needs to have their dashboard change accordingly. In short, there is a need for internationalization, hence, shiny.i18n.</p>
<p>We’ve build an internationalization package for shiny and have open sourced it for all to use, and hopefully, contribute to. This is not our first open source package. We encourage you to take a look at shiny.semantic, and shiny.collections.</p>
<p>The i18n package support CSV and JSON based translations. It even formats data according to local standards. We are still working on localized numbers and are in the process of adding our package to CRAN. Feel free to check our <a href="https://github.com/Appsilon/shiny.i18n/blob/master/inst/examples/basic/app_csv.R"><em>demo</em></a> which currently supports English, Polish and Italian:</p>
<p>Our code is on <a href="https://github.com/Appsilon/shiny.i18n"><em>GitHub</em></a> and is under an MIT license. Take a look and let us know what you think.</p>
Tue, 12 Sep 2017 00:00:00 +0000
http://appsilondatascience.com/blog/rstats/2017/09/12/i18n.html
/blog/rstats/2017/09/12/i18n.htmlshinyri18ninternationalizationrstats