import React from 'react'; 
import {Link} from 'react-router-dom'; 
import {useRCustomEffect} from '../../useCustomEffect'; 
import imgVisualizeSkewedData from '../graphics_blog/summary_noLabel.png'; 
import imgVisualizeSkewedDataWebp from '../graphics_blog/summary_noLabel.webp'; 
export default function VisualizeSkewedData(){
useRCustomEffect()
return ( <div>
<div className="page-columns page-rows-contents page-layout-article" id="quarto-content">
<main className="content" id="quarto-document-content">
<header className="quarto-title-block default" id="title-block-header">
<div className="quarto-title">
<h1 className="title">Golden Methods to Visualize Data with Skewed Distribution</h1>
<p className="subtitle lead">Key Techniques to Unveil Hidden Patterns in Clustered Data</p>
</div>
<div className="quarto-title-meta">
</div>
</header>
  <picture>
    <source type="image/webp" srcset={imgVisualizeSkewedDataWebp} />
    <img className="cover-img" src={imgVisualizeSkewedData} />
  </picture>

<p><br/></p>
<p><strong>Skewed data refers to data with highly uneven distribution</strong>: when the data of a variable is displayed as a histogram, the bulk of data points are either clustered on the left side of the distribution, with a long tail stretching towards the right (right-skewed), or the other way around (left-skewed), or in more complex skewed pattern. <strong><em>The long tail (sparse data) dominates the visual space, and squeezes the majority of the data points to the graphic corner, making it difficult to discern the underlying pattern of the clustered data majority.</em></strong></p>
<p>In this article, I’ve summarized seven powerful strategies to visualize data with skewed distribution.</p>
<blockquote className="blockquote">
<p>If you are R user, you can click the <code>show the code</code> button to check the ggplot2 source code used to generate these graphs; otherwise a link to source code will be provided. <em>If you are not an R user</em>, you can still enjoy this article!</p>
</blockquote>
<section className="level3" id="use-transparency-open-circles-colors-and-marginal-distribution-to-unveil-clustered-data-pattern">
<h3 className="anchored" data-anchor-id="use-transparency-open-circles-colors-and-marginal-distribution-to-unveil-clustered-data-pattern">Use Transparency, Open Circles, Colors, and Marginal Distribution to Unveil Clustered Data Pattern</h3>
<p>The following scatterplot displays the relationship between the housing sales and prices. The sparse data points at higher sales squeeze the majority of data to the left edge of the plot. This leads to significant data point clustering and overlap, and obscures the underlying distribution pattern.</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb1"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb1-1"><a aria-hidden="true" href="#cb1-1" tabindex="-1"></a><span className="fu">library</span>(tidyverse)</span>
<span id="cb1-2"><a aria-hidden="true" href="#cb1-2" tabindex="-1"></a><span className="fu">library</span>(scico) <span className="co"># color palette</span></span>
<span id="cb1-3"><a aria-hidden="true" href="#cb1-3" tabindex="-1"></a></span><br/>
<span id="cb1-4"><a aria-hidden="true" href="#cb1-4" tabindex="-1"></a><span className="co"># set the default theme</span></span>
<span id="cb1-5"><a aria-hidden="true" href="#cb1-5" tabindex="-1"></a><span className="fu">theme_set</span>(</span>
<span id="cb1-6"><a aria-hidden="true" href="#cb1-6" tabindex="-1"></a>  hrbrthemes<span className="sc">::</span><span className="fu">theme_ipsum</span>()<span className="sc">+</span></span>
<span id="cb1-7"><a aria-hidden="true" href="#cb1-7" tabindex="-1"></a>    <span className="fu">theme</span>(</span>
<span id="cb1-8"><a aria-hidden="true" href="#cb1-8" tabindex="-1"></a>      <span className="at">plot.title =</span> <span className="fu">element_text</span>(<span className="at">hjust =</span> .<span className="dv">5</span>, <span className="at">face =</span> <span className="st">"bold"</span>, <span className="at">size =</span> <span className="dv">12</span>)))</span>
<span id="cb1-9"><a aria-hidden="true" href="#cb1-9" tabindex="-1"></a></span><br/>
<span id="cb1-10"><a aria-hidden="true" href="#cb1-10" tabindex="-1"></a>base <span className="ot">&lt;-</span> txhousing <span className="sc">%&gt;%</span> </span>
<span id="cb1-11"><a aria-hidden="true" href="#cb1-11" tabindex="-1"></a>  <span className="fu">ggplot</span>(<span className="fu">aes</span>(<span className="at">x =</span> sales, <span className="at">y =</span> median))</span>
<span id="cb1-12"><a aria-hidden="true" href="#cb1-12" tabindex="-1"></a></span><br/>
<span id="cb1-13"><a aria-hidden="true" href="#cb1-13" tabindex="-1"></a>base <span className="sc">+</span> </span>
<span id="cb1-14"><a aria-hidden="true" href="#cb1-14" tabindex="-1"></a>  <span className="fu">geom_point</span>(<span className="at">size =</span> <span className="dv">1</span>) <span className="sc">+</span> </span>
<span id="cb1-15"><a aria-hidden="true" href="#cb1-15" tabindex="-1"></a>  <span className="co"># discrete scale of the 'color' aesthetic ('scico' package)</span></span>
<span id="cb1-16"><a aria-hidden="true" href="#cb1-16" tabindex="-1"></a>  <span className="fu">scale_color_scico_d</span>(<span className="at">palette =</span> <span className="st">"berlin"</span>) </span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_scatterplot.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_scatterplot.png" data-fallback="graphics_blog/housing_skew_scatterplot.png" />
  </picture>

</div>
</div>
</div>
<p>The following improved scatterplot employs some simple techniques to diagnose data skewness and unveil underlying data structure.</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb2"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb2-1"><a aria-hidden="true" href="#cb2-1" tabindex="-1"></a>p <span className="ot">&lt;-</span> base <span className="sc">+</span> </span>
<span id="cb2-2"><a aria-hidden="true" href="#cb2-2" tabindex="-1"></a>  <span className="co"># map 'city' variable to 'color', with 50% transparency  </span></span>
<span id="cb2-3"><a aria-hidden="true" href="#cb2-3" tabindex="-1"></a>  <span className="fu">geom_point</span>(<span className="fu">aes</span>(<span className="at">color =</span> city), <span className="at">size =</span> .<span className="dv">5</span>, <span className="at">alpha =</span> .<span className="dv">5</span>, <span className="at">shape =</span> <span className="dv">21</span>) <span className="sc">+</span></span>
<span id="cb2-4"><a aria-hidden="true" href="#cb2-4" tabindex="-1"></a>  <span className="co"># visualize marginal distribution as a barcode</span></span>
<span id="cb2-5"><a aria-hidden="true" href="#cb2-5" tabindex="-1"></a>  <span className="fu">geom_rug</span>(<span className="fu">aes</span>(<span className="at">color =</span> city), <span className="at">alpha =</span> .<span className="dv">04</span>, <span className="at">sides =</span> <span className="st">"tr"</span>) <span className="sc">+</span></span>
<span id="cb2-6"><a aria-hidden="true" href="#cb2-6" tabindex="-1"></a>  <span className="co"># turn off legends of both points and rugs</span></span>
<span id="cb2-7"><a aria-hidden="true" href="#cb2-7" tabindex="-1"></a>  <span className="fu">theme</span>(<span className="at">legend.position =</span> <span className="st">"none"</span>) <span className="sc">+</span></span>
<span id="cb2-8"><a aria-hidden="true" href="#cb2-8" tabindex="-1"></a>  <span className="co"># discrete scale of the 'color' aesthetic ('scico' package)</span></span>
<span id="cb2-9"><a aria-hidden="true" href="#cb2-9" tabindex="-1"></a>  <span className="fu">scale_color_scico_d</span>(<span className="at">palette =</span> <span className="st">"berlin"</span>)</span>
<span id="cb2-10"><a aria-hidden="true" href="#cb2-10" tabindex="-1"></a></span><br/>
<span id="cb2-11"><a aria-hidden="true" href="#cb2-11" tabindex="-1"></a><span className="co"># visualize marginal distribution with the ggMarginal package</span></span>
<span id="cb2-12"><a aria-hidden="true" href="#cb2-12" tabindex="-1"></a>ggExtra<span className="sc">::</span><span className="fu">ggMarginal</span>(p, <span className="at">groupColour =</span> T)</span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_marginal.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_marginal.png" data-fallback="graphics_blog/housing_skew_marginal.png" />
  </picture>

</div>
</div>
</div>
<ul>
<li><strong>Smaller point size</strong>, and <strong>shapes in outlines</strong> (such as open circles instead of solid ones) help mitigate the data overlap. </li>
<li><strong>Transparency</strong> of the data points helps unveil the amount of data clustering. </li>
<li><strong>Colors</strong> are an important tool to unveil data pattern. In this case, showing points of different cities in varied colors spotlights that the city has a profound impact on the price and sales, a significant source of data variability.</li>
<li><strong>Marginal visualization</strong> of univariate distribution shows the source of skewness. Here I displayed the distribution of sales (x axis) and prices (y axis) as barcode images and density plots. It shows that the sales is the major skewed variable.</li>
</ul>
</section>
<section className="level3" id="create-subplot-separately-for-sources-of-skewness">
<h3 className="anchored" data-anchor-id="create-subplot-separately-for-sources-of-skewness">Create Subplot Separately for Sources of Skewness</h3>
<p>The above plot suggests that the city is an important contributing factor to the data skewness. To eliminate the effect of cities to the skewed data distribution, here we <strong>divide the plot into subplots separately for each city</strong>, with <strong>each plot having its own axis scale</strong> (labels not shown for clarity).</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb3"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb3-1"><a aria-hidden="true" href="#cb3-1" tabindex="-1"></a>base <span className="sc">+</span> </span>
<span id="cb3-2"><a aria-hidden="true" href="#cb3-2" tabindex="-1"></a>  <span className="fu">geom_point</span>(<span className="fu">aes</span>(<span className="at">color =</span> city), </span>
<span id="cb3-3"><a aria-hidden="true" href="#cb3-3" tabindex="-1"></a>             <span className="at">alpha =</span> .<span className="dv">5</span>, <span className="at">size =</span> <span className="dv">1</span>, <span className="at">show.legend =</span> F) <span className="sc">+</span></span>
<span id="cb3-4"><a aria-hidden="true" href="#cb3-4" tabindex="-1"></a>  <span className="fu">facet_wrap</span>(<span className="sc">~</span>city, <span className="at">scales =</span> <span className="st">"free"</span>, <span className="at">ncol =</span> <span className="dv">8</span>) <span className="sc">+</span></span>
<span id="cb3-5"><a aria-hidden="true" href="#cb3-5" tabindex="-1"></a>  <span className="fu">scale_color_scico_d</span>(<span className="at">palette =</span> <span className="st">"berlin"</span>) <span className="sc">+</span></span>
<span id="cb3-6"><a aria-hidden="true" href="#cb3-6" tabindex="-1"></a>  <span className="fu">theme_void</span>() <span className="sc">+</span></span>
<span id="cb3-7"><a aria-hidden="true" href="#cb3-7" tabindex="-1"></a>  <span className="fu">theme</span>(</span>
<span id="cb3-8"><a aria-hidden="true" href="#cb3-8" tabindex="-1"></a>    <span className="at">panel.border =</span> <span className="fu">element_rect</span>(<span className="at">color =</span> <span className="st">"black"</span>, <span className="at">fill =</span> <span className="cn">NA</span>),</span>
<span id="cb3-9"><a aria-hidden="true" href="#cb3-9" tabindex="-1"></a>    <span className="at">plot.margin =</span> <span className="fu">margin</span>(<span className="fu">rep</span>(<span className="dv">10</span>, <span className="dv">4</span>)),</span>
<span id="cb3-10"><a aria-hidden="true" href="#cb3-10" tabindex="-1"></a>    <span className="at">strip.text =</span> <span className="fu">element_text</span>(<span className="at">face =</span> <span className="st">"bold"</span>, <span className="at">color =</span> <span className="st">"grey30"</span>)</span>
<span id="cb3-11"><a aria-hidden="true" href="#cb3-11" tabindex="-1"></a>  )</span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_facet.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_facet.png" data-fallback="graphics_blog/housing_skew_facet.png" />
  </picture>

</div>
</div>
</div>
<p>While we were lucky enough to identify the city variable as an important source of data skewness, and to reduce data skewness by creating subplots separately for each city, it’s not always easy or possible to do so in other data viz tasks. In addition, stubborn outliers can exist within a single variable, which is not solvable by creating subplots. </p>
<p><strong>In the following discussion, I’ll demonstrate powerful techniques to visualize skewed dataset <em>directly</em>, without relying on subplots of the city variable.</strong></p>
</section>
<section className="level3" id="create-2d-histogram-to-visualize-skewed-the-data">
<h3 className="anchored" data-anchor-id="create-2d-histogram-to-visualize-skewed-the-data">Create 2D Histogram to Visualize Skewed the Data</h3>
<p>2D histogram is one of the most efficient approaches to address data skewness encountered in a scatterplot. It creates n intervals / bins along the x and y axes, and divide the plot into a grid of n × n cells. <strong>The number of data points falling into each cell is counted, and then mapped to a color scale</strong>. Higher bin number creates a higher resolution.</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb4"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb4-1"><a aria-hidden="true" href="#cb4-1" tabindex="-1"></a>base <span className="sc">+</span> </span>
<span id="cb4-2"><a aria-hidden="true" href="#cb4-2" tabindex="-1"></a>  <span className="co"># Create a 2D histogram</span></span>
<span id="cb4-3"><a aria-hidden="true" href="#cb4-3" tabindex="-1"></a>  <span className="fu">geom_bin_2d</span>(<span className="at">bins =</span> <span className="dv">100</span>) <span className="sc">+</span> </span>
<span id="cb4-4"><a aria-hidden="true" href="#cb4-4" tabindex="-1"></a>  <span className="co"># continuous color scale of 'fill' aesthetic</span></span>
<span id="cb4-5"><a aria-hidden="true" href="#cb4-5" tabindex="-1"></a>  <span className="fu">scale_fill_viridis_c</span>(<span className="at">option =</span> <span className="st">"B"</span>) <span className="sc">+</span> </span>
<span id="cb4-6"><a aria-hidden="true" href="#cb4-6" tabindex="-1"></a>  <span className="co"># adjust colorbar height</span></span>
<span id="cb4-7"><a aria-hidden="true" href="#cb4-7" tabindex="-1"></a>  <span className="fu">guides</span>(<span className="at">fill =</span> <span className="fu">guide_colorbar</span>(<span className="at">barheight =</span> <span className="fu">unit</span>(<span className="dv">200</span>, <span className="st">"pt"</span>))) <span className="sc">+</span></span>
<span id="cb4-8"><a aria-hidden="true" href="#cb4-8" tabindex="-1"></a>  <span className="fu">theme</span>(<span className="at">plot.margin =</span> <span className="fu">margin</span>(<span className="fu">rep</span>(<span className="dv">0</span>, <span className="dv">4</span>)))</span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_2D_histogram.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_2D_histogram.png" data-fallback="graphics_blog/housing_skew_2D_histogram.png" />
  </picture>

</div>
</div>
</div>
</section>
<section className="level3" id="create-2d-density-plot">
<h3 className="anchored" data-anchor-id="create-2d-density-plot">Create 2D Density Plot</h3>
<p>A 2D density plot is similar to the 2D histogram, but instead of creating counts in discrete intervals, it <strong>estimates the probability density function (PDF) of the underlying continuous data</strong>. (Yet the density can be drawn out in both formats, either as both continuous contours or in discrete grids.)</p>
<p>The density plot on the left side spotlights the extreme concentration of data points at the bottom left corner of the plot, while the rest of the sparse data points, with their minimal density values, are completely blended into background (of density of zero).</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb5"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb5-1"><a aria-hidden="true" href="#cb5-1" tabindex="-1"></a><span className="co"># Create 2D density plots with density scale transformed or not</span></span>
<span id="cb5-2"><a aria-hidden="true" href="#cb5-2" tabindex="-1"></a>f <span className="ot">&lt;-</span> <span className="cf">function</span>(</span>
<span id="cb5-3"><a aria-hidden="true" href="#cb5-3" tabindex="-1"></a>    dens_exp, <span className="co"># raise density to power of 'dens_exp' before mapping to the color scale  </span></span>
<span id="cb5-4"><a aria-hidden="true" href="#cb5-4" tabindex="-1"></a>    myTitle   <span className="co"># plot title, noting the density transformation  </span></span>
<span id="cb5-5"><a aria-hidden="true" href="#cb5-5" tabindex="-1"></a>)&#123; </span>
<span id="cb5-6"><a aria-hidden="true" href="#cb5-6" tabindex="-1"></a>  base <span className="sc">+</span> </span>
<span id="cb5-7"><a aria-hidden="true" href="#cb5-7" tabindex="-1"></a>    <span className="fu">stat_density_2d</span>(</span>
<span id="cb5-8"><a aria-hidden="true" href="#cb5-8" tabindex="-1"></a>      <span className="co"># transform the density scale:</span></span>
<span id="cb5-9"><a aria-hidden="true" href="#cb5-9" tabindex="-1"></a>      <span className="co"># dens_exp = 1   - map the density (default) directly to color scale</span></span>
<span id="cb5-10"><a aria-hidden="true" href="#cb5-10" tabindex="-1"></a>      <span className="co"># dens_exp = 1/3 - map cubit root of density to color scale</span></span>
<span id="cb5-11"><a aria-hidden="true" href="#cb5-11" tabindex="-1"></a>      <span className="fu">aes</span>(<span className="at">fill =</span> <span className="fu">after_stat</span>(density) <span className="sc">^</span> dens_exp), </span>
<span id="cb5-12"><a aria-hidden="true" href="#cb5-12" tabindex="-1"></a>      </span>
<span id="cb5-13"><a aria-hidden="true" href="#cb5-13" tabindex="-1"></a>      <span className="at">geom =</span> <span className="st">"raster"</span>, <span className="co"># draw as grids </span></span>
<span id="cb5-14"><a aria-hidden="true" href="#cb5-14" tabindex="-1"></a>      <span className="at">n =</span> <span className="dv">100</span>,         <span className="co"># at resolution of 100 bins per axis</span></span>
<span id="cb5-15"><a aria-hidden="true" href="#cb5-15" tabindex="-1"></a>      <span className="at">contour =</span> F      <span className="co"># not draw contours</span></span>
<span id="cb5-16"><a aria-hidden="true" href="#cb5-16" tabindex="-1"></a>    ) <span className="sc">+</span></span>
<span id="cb5-17"><a aria-hidden="true" href="#cb5-17" tabindex="-1"></a>    <span className="fu">theme</span>(<span className="at">legend.position =</span> <span className="st">"bottom"</span>, </span>
<span id="cb5-18"><a aria-hidden="true" href="#cb5-18" tabindex="-1"></a>          <span className="at">plot.margin =</span> <span className="fu">margin</span>(<span className="dv">0</span>, <span className="dv">4</span>)) <span className="sc">+</span></span>
<span id="cb5-19"><a aria-hidden="true" href="#cb5-19" tabindex="-1"></a>    <span className="fu">labs</span>(<span className="at">title =</span> myTitle,  <span className="co"># note the density scale transformation</span></span>
<span id="cb5-20"><a aria-hidden="true" href="#cb5-20" tabindex="-1"></a>         <span className="at">fill =</span> <span className="cn">NULL</span>) <span className="sc">+</span>    <span className="co"># remove colorbar title</span></span>
<span id="cb5-21"><a aria-hidden="true" href="#cb5-21" tabindex="-1"></a>    <span className="co"># customize colorbar appearance</span></span>
<span id="cb5-22"><a aria-hidden="true" href="#cb5-22" tabindex="-1"></a>    <span className="fu">guides</span>(<span className="at">fill =</span> <span className="fu">guide_colorbar</span>(</span>
<span id="cb5-23"><a aria-hidden="true" href="#cb5-23" tabindex="-1"></a>      <span className="at">barwidth =</span> <span className="fu">unit</span>(<span className="dv">150</span>, <span className="st">"pt"</span>),</span>
<span id="cb5-24"><a aria-hidden="true" href="#cb5-24" tabindex="-1"></a>      <span className="at">title.theme =</span> <span className="fu">element_text</span>(<span className="at">hjust =</span> .<span className="dv">5</span>))) <span className="sc">+</span></span>
<span id="cb5-25"><a aria-hidden="true" href="#cb5-25" tabindex="-1"></a>    <span className="co"># continuous color scale of the 'fill' aesthetic</span></span>
<span id="cb5-26"><a aria-hidden="true" href="#cb5-26" tabindex="-1"></a>    <span className="fu">scale_fill_viridis_c</span>(<span className="at">option =</span> <span className="st">"B"</span>)</span>
<span id="cb5-27"><a aria-hidden="true" href="#cb5-27" tabindex="-1"></a>&#125;</span>
<span id="cb5-28"><a aria-hidden="true" href="#cb5-28" tabindex="-1"></a></span><br/>
<span id="cb5-29"><a aria-hidden="true" href="#cb5-29" tabindex="-1"></a>p1 <span className="ot">&lt;-</span> <span className="fu">f</span>(<span className="at">dens_exp =</span> <span className="dv">1</span>,   <span className="at">myTitle =</span> <span className="st">"density (Linear)"</span>)</span>
<span id="cb5-30"><a aria-hidden="true" href="#cb5-30" tabindex="-1"></a>p2 <span className="ot">&lt;-</span> <span className="fu">f</span>(<span className="at">dens_exp =</span> <span className="dv">1</span><span className="sc">/</span><span className="dv">4</span>, <span className="at">myTitle =</span> <span className="st">"density (cubit root)"</span>)</span>
<span id="cb5-31"><a aria-hidden="true" href="#cb5-31" tabindex="-1"></a></span><br/>
<span id="cb5-32"><a aria-hidden="true" href="#cb5-32" tabindex="-1"></a><span className="fu">library</span>(patchwork)</span>
<span id="cb5-33"><a aria-hidden="true" href="#cb5-33" tabindex="-1"></a>p1 <span className="sc">|</span> p2 </span></code></pre></div>
</details>
</div>
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_2D_density.webp" />
    <img className="wideFig img-fluid" src="graphics_blog/housing_skew_2D_density.png" data-fallback="graphics_blog/housing_skew_2D_density.png" />
  </picture>

<p>To make those sparse data more visible, here I first converted the density values to its cubit root, which are then mapped to the color scale. <strong>Root transformation increases the value of fractional numbers</strong> (of the density values), <strong>but does not affect zeros</strong> (of the background where there is no data). This way, the transformed data points “rise” higher above the background to be easier distinguished, and nicely brings out both the concentrated data center and the sparse data regions.(This inflation effect on fractional numbers however can leads to undesired visual effect in some cases, as demonstrated in a later example.)</p>
</section>
<section className="level3" id="scatterplot-with-neighbor-number-shown-in-colors">
<h3 className="anchored" data-anchor-id="scatterplot-with-neighbor-number-shown-in-colors">Scatterplot with Neighbor Number Shown in Colors</h3>
<p>Another approach is to create a scatterplot, <strong>with each point’s local density (number of neighbors) mapped to colors</strong>. The visual effect is similar to a 2D histogram shown above, but <em>with each individual point displayed</em>, instead of using counting cells in a grid. Therefore, it unveils both details of individualism and overall data pattern. (In R, such visualization can be easily achieved using the <code>geom_pointdensity()</code> function from the <Link to="https://github.com/LKremer/ggpointdensity">ggpointdensity</Link> package.)</p>
<p>This approach however does not resolve the data overlap, and points buried at lower layers have low visibility, rendering the color scale less efficient. To aid in unveiling the data pattern, an effective solution is to mathematically transform the density values before mapping them to the color scale (similar to the above-mentioned 2D density plot). Here I converted each point’s neighbor number to logarithm, and then visualized the transformed values using the color scale (the colobar is labeled with <em>original</em> number, not the transformed result, for improved readability).</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb6"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb6-1"><a aria-hidden="true" href="#cb6-1" tabindex="-1"></a><span className="fu">library</span>(ggpointdensity)</span>
<span id="cb6-2"><a aria-hidden="true" href="#cb6-2" tabindex="-1"></a></span><br/>
<span id="cb6-3"><a aria-hidden="true" href="#cb6-3" tabindex="-1"></a>p3 <span className="ot">&lt;-</span> base <span className="sc">+</span> </span>
<span id="cb6-4"><a aria-hidden="true" href="#cb6-4" tabindex="-1"></a>  <span className="co"># create scatterplot-density plot</span></span>
<span id="cb6-5"><a aria-hidden="true" href="#cb6-5" tabindex="-1"></a>  <span className="fu">geom_pointdensity</span>() <span className="sc">+</span> </span>
<span id="cb6-6"><a aria-hidden="true" href="#cb6-6" tabindex="-1"></a>  <span className="co"># map log2(neighbor number) to the continuous scale of 'color' aesthetic</span></span>
<span id="cb6-7"><a aria-hidden="true" href="#cb6-7" tabindex="-1"></a>  <span className="co"># other transformation such as 'log10' or square roots ('sqrt') also work</span></span>
<span id="cb6-8"><a aria-hidden="true" href="#cb6-8" tabindex="-1"></a>  <span className="fu">scale_color_viridis_c</span>(<span className="at">option =</span> <span className="st">"B"</span>, <span className="at">trans =</span> <span className="st">"log2"</span>, <span className="at">breaks =</span> <span className="dv">2</span><span className="sc">^</span>(<span className="dv">0</span><span className="sc">:</span><span className="dv">4</span>)) <span className="sc">+</span></span>
<span id="cb6-9"><a aria-hidden="true" href="#cb6-9" tabindex="-1"></a>  <span className="co"># increase colorbar length</span></span>
<span id="cb6-10"><a aria-hidden="true" href="#cb6-10" tabindex="-1"></a>  <span className="fu">guides</span>(<span className="at">color =</span> <span className="fu">guide_colorbar</span>(<span className="at">barheight =</span> <span className="fu">unit</span>(<span className="dv">200</span>, <span className="st">"pt"</span>))) <span className="sc">+</span></span>
<span id="cb6-11"><a aria-hidden="true" href="#cb6-11" tabindex="-1"></a>  <span className="fu">ggtitle</span>(<span className="st">"Scatterplot with density-based colors scale"</span>)</span>
<span id="cb6-12"><a aria-hidden="true" href="#cb6-12" tabindex="-1"></a></span><br/>
<span id="cb6-13"><a aria-hidden="true" href="#cb6-13" tabindex="-1"></a>p3</span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_pointdensity.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_pointdensity.png" data-fallback="graphics_blog/housing_skew_pointdensity.png" />
  </picture>

</div>
</div>
</div>
</section>
<section className="level3" id="mathematical-transform-of-data-mapped-to-color">
<h3 className="anchored" data-anchor-id="mathematical-transform-of-data-mapped-to-color">Mathematical Transform of Data Mapped to Color</h3>
<p>You have seen from examples above how mathematical transformation effectively aids in data visualization. <strong><em>Yet below is my favorite example.</em></strong></p>
<p>The heatmaps show the African population per square km: the minimum density is 0 (e.g., in most parts of Egypt on the map), and non-zero values range from 0.0000004 to 20,000; large density values are found only in limited geographic regions. The density is mirrored with the classic viridis color scale.</p>
<p><strong>If without any data transformation, the map is utterly blacked out</strong> (see plot on the left): the very scattered large values (depicted in brighter colors) are overwhelmed and drowned out by the bulk of smaller numbers.</p>
<div className="quarto-figure quarto-figure-center">
<div className="mg-t-20 mg-b-20">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/pseudo_log_transform.webp" />
    <img className="wideFig img-fluid figure-img" src="graphics_blog/pseudo_log_transform.png" data-fallback="graphics_blog/pseudo_log_transform.png" />
  </picture>
<figcaption>Log and pseudo-log transform unveil the pattern of highly skewed data. <Link to="/blog/pseudo-log-transformation"><strong>See more details and R source code here.</strong></Link></figcaption>
</div>
</div>
<p><strong>Logarithmic transformation</strong> is a common approach to deal with the widely spread data. It <strong>turns the data to a more normalized distribution</strong>, thus easier to be visualized. This approach creates a heatmap with better delineated data pattern, such as the dense population along the Nile river in Egypt. (The colorbar is labeled with original data, not the transformed data, for easier interpretation). <strong>Most parts in Egypt, however, is greyed out</strong>; as these places have population values of <strong>zero, which is not defined in logarithm</strong>, the data is treated as “missing values” (-Inf, to be specific).</p>
<p><strong>Pseudo-logarithmic transformation performs the classic logarithmic operation for large numbers, but gradually transitions to a linear scale as the values approach zero.</strong> It handles large and fractional numbers and zeros (and negative numbers) in a smooth manner. Such transformation generates an impressive heatmap with well defined data pattern, highlighting the most populous geographical sites, and the dire inhospitality of the vast Saharan desert in the map. And you can find more details of the amazing technique <Link to="/blog/pseudo-log-transformation"><strong><em>in this article</em></strong></Link>.</p>
<p>Compared to pseudo-log transformation, root transformation with a higher base (e.g. 7 ~ 10) yields a comparable visual effect. However, Egypt and some other places are blacked out. This is because <strong>root transformations inflate fractional numbers but leave zeros unaffected</strong>. This results in a <strong>sharp discontinuity between transformed fractional numbers and zeros, mirrored as abrupt color transition</strong> between Egypt and the surrounding areas. (<em>You can find more discussion from the same linked-article above</em>)</p>
<div className="quarto-figure quarto-figure-center">
<div className="mg-t-20 mg-b-20">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/pseudo_log_logarithmic_transform.webp" />
    <img className="wideFig img-fluid figure-img" src="graphics_blog/pseudo_log_logarithmic_transform.png" data-fallback="graphics_blog/pseudo_log_logarithmic_transform.png" />
  </picture>
<figcaption>Root transform creates discontinued color transition. <Link to="/blog/pseudo-log-transformation"><strong>See more details and R source code here</strong></Link>.</figcaption>
</div>
</div>
<p><em>Side note:</em> In addition to mathematical transformation of the original data, you can instead retain the original data, but strategically adjust its association with colors, and how outliers should be graphically dampened or emphasized. For instance, the following heatmaps display the incidences of infectious diseases in U.S. before and after the introduction of the vaccine (marked by the black line). The plots employ a large range of alarming yellows and reds to highlight the high incidences (the sparse data region), and only a short range of blues and greens for the bulk of low-incidence data. Such design effectively spotlighted the power of vaccination in disease control.</p>
<div className="quarto-figure quarto-figure-center">
<div className="mg-t-20 mg-b-20">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/heatmap_vaccine_function_completed_all_8_diseases.webp" />
    <img className="wideFig img-fluid figure-img" src="graphics_blog/heatmap_vaccine_function_completed_all_8_diseases.png" data-fallback="graphics_blog/heatmap_vaccine_function_completed_all_8_diseases.png" />
  </picture>
<figcaption>See R source code <Link to="/R/gallery/ggplot2-heatmap-vaccine-for-8-diseases"><strong>here</strong></Link>.</figcaption>
</div>
</div>
</section>
<section className="level3" id="mathematical-transform-of-data-mirrored-to-axes">
<h3 className="anchored" data-anchor-id="mathematical-transform-of-data-mirrored-to-axes">Mathematical Transform of Data Mirrored to Axes</h3>
<p>Now back to the housing sale-price example. In plots below, the x-axis is transformed into a logarithmic scale of base 10. This way, the otherwise heavily clustered data points are more spread apart to unveil the data structure.</p>
<p>For extra fun, I’ve created out of the same data two plots yet with a different aesthetic flavor. The first one is a scatterplot, with each point’s neighbor number linearly mapped to the color scale.</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb7"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb7-1"><a aria-hidden="true" href="#cb7-1" tabindex="-1"></a>base2 <span className="ot">&lt;-</span> base <span className="sc">+</span> </span>
<span id="cb7-2"><a aria-hidden="true" href="#cb7-2" tabindex="-1"></a>  <span className="co"># transform x-axis to log10 scale</span></span>
<span id="cb7-3"><a aria-hidden="true" href="#cb7-3" tabindex="-1"></a>  <span className="fu">scale_x_log10</span>() <span className="sc">+</span></span>
<span id="cb7-4"><a aria-hidden="true" href="#cb7-4" tabindex="-1"></a>  <span className="co"># add log10-based ticks at the bottom side of the plot</span></span>
<span id="cb7-5"><a aria-hidden="true" href="#cb7-5" tabindex="-1"></a>  <span className="fu">annotation_logticks</span>(<span className="at">sides =</span> <span className="st">"b"</span>) <span className="sc">+</span></span>
<span id="cb7-6"><a aria-hidden="true" href="#cb7-6" tabindex="-1"></a>  <span className="fu">guides</span>(<span className="at">color =</span> <span className="fu">guide_colorbar</span>(<span className="at">barheight =</span> <span className="fu">unit</span>(<span className="dv">200</span>, <span className="st">"pt"</span>))) </span>
<span id="cb7-7"><a aria-hidden="true" href="#cb7-7" tabindex="-1"></a></span><br/>
<span id="cb7-8"><a aria-hidden="true" href="#cb7-8" tabindex="-1"></a><span className="co"># Create an open circle - density plot at a log10-transformed x-axis.</span></span>
<span id="cb7-9"><a aria-hidden="true" href="#cb7-9" tabindex="-1"></a>base2 <span className="sc">+</span> </span>
<span id="cb7-10"><a aria-hidden="true" href="#cb7-10" tabindex="-1"></a>  <span className="fu">geom_pointdensity</span>(<span className="at">shape =</span> <span className="dv">21</span>) <span className="sc">+</span> </span>
<span id="cb7-11"><a aria-hidden="true" href="#cb7-11" tabindex="-1"></a>  <span className="fu">scale_color_viridis_c</span>(<span className="at">option =</span> <span className="st">"B"</span>, <span className="at">breaks =</span> <span className="fu">seq</span>(<span className="dv">0</span>, <span className="dv">1000</span>, <span className="dv">200</span>))</span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_log10_pointdensity.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_log10_pointdensity.png" data-fallback="graphics_blog/housing_skew_log10_pointdensity.png" />
  </picture>

</div>
</div>
</div>
<p>The second plot is a 2D histogram (discrete), with probability density function drawn out as contour circles (continuous).</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb8"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb8-1"><a aria-hidden="true" href="#cb8-1" tabindex="-1"></a><span className="co"># A couple of technical notes:</span></span>
<span id="cb8-2"><a aria-hidden="true" href="#cb8-2" tabindex="-1"></a><span className="co"># 1. Set `begin = .1` in the color scale of the `fill` aesthetic, so that colors associated with the lowest counts start from brighter colors, not black, to be distinguished from the background.</span></span>
<span id="cb8-3"><a aria-hidden="true" href="#cb8-3" tabindex="-1"></a><span className="co"># 2. Ticks color should be set within `annotation_logticks()`; setting in `axis.ticks` in `theme()` is not effective. </span></span>
<span id="cb8-4"><a aria-hidden="true" href="#cb8-4" tabindex="-1"></a></span><br/>
<span id="cb8-5"><a aria-hidden="true" href="#cb8-5" tabindex="-1"></a>base2 <span className="sc">+</span> </span>
<span id="cb8-6"><a aria-hidden="true" href="#cb8-6" tabindex="-1"></a>  <span className="fu">geom_bin_2d</span>(<span className="at">bins =</span> <span className="dv">100</span>) <span className="sc">+</span></span>
<span id="cb8-7"><a aria-hidden="true" href="#cb8-7" tabindex="-1"></a>  <span className="fu">scale_fill_viridis_c</span>(<span className="at">option =</span> <span className="st">"C"</span>, <span className="at">begin =</span> .<span className="dv">1</span>) <span className="sc">+</span></span>
<span id="cb8-8"><a aria-hidden="true" href="#cb8-8" tabindex="-1"></a>  <span className="fu">guides</span>(<span className="at">fill =</span> <span className="fu">guide_colorbar</span>(<span className="at">barheight =</span> <span className="fu">unit</span>(<span className="dv">200</span>, <span className="st">"pt"</span>))) <span className="sc">+</span></span>
<span id="cb8-9"><a aria-hidden="true" href="#cb8-9" tabindex="-1"></a>  <span className="fu">theme</span>(<span className="at">panel.background =</span> <span className="fu">element_rect</span>(<span className="at">fill =</span> <span className="st">"black"</span>)) <span className="sc">+</span></span>
<span id="cb8-10"><a aria-hidden="true" href="#cb8-10" tabindex="-1"></a>  <span className="fu">annotation_logticks</span>(<span className="at">sides =</span> <span className="st">"b"</span>, <span className="at">colour =</span> <span className="st">"snow3"</span>) <span className="sc">+</span></span>
<span id="cb8-11"><a aria-hidden="true" href="#cb8-11" tabindex="-1"></a>  <span className="co"># sketch out the contour circles</span></span>
<span id="cb8-12"><a aria-hidden="true" href="#cb8-12" tabindex="-1"></a>  <span className="fu">geom_density2d</span>(<span className="at">color =</span> <span className="st">"snow2"</span>, <span className="at">linewidth =</span> .<span className="dv">2</span>)</span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_log10_2D_histogram.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_log10_2D_histogram.png" data-fallback="graphics_blog/housing_skew_log10_2D_histogram.png" />
  </picture>

</div>
</div>
</div>
<p>As another example, the scatterplot below shows the increase of GDP and human life expectancy from 1800s to 2015. It illustrates the use of axial logarithmic transformation to unfold otherwise clustered dataset. Note that while the x-axis is on log10 scale, the labels are annotated with original data values.</p>
<div className="quarto-figure quarto-figure-center">
<div className="mg-t-20 mg-b-20">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/X-trans.webp" />
    <img className="wideFig img-fluid figure-img" src="graphics_blog/X-trans.png" data-fallback="graphics_blog/X-trans.png" />
  </picture>
<figcaption>See R source code <Link to="/R/gallery/ggplot2-line-plot-gapminder"><strong>here</strong></Link>.</figcaption>
</div>
</div>
<p>Note that <strong>not all mathematical transformations change the distribution profile</strong>. For instance, the commonly used <strong>standardization into z-scores</strong> (centering by mean and scaling by standard deviation) or <strong>normalization into the range [0, 1]</strong> (shifting by the minimum and scaling by the range between the max and min) <strong>do NOT change the distribution profile</strong> in this context. These methods are <strong>NOT effective in visualizing skewed data</strong> (though they are important in multivariate modeling and related visualizations such as principle component analysis).</p>
</section>
<section className="level3" id="zoom-in-over-the-clustered-data-region">
<h3 className="anchored" data-anchor-id="zoom-in-over-the-clustered-data-region">Zoom-in Over the Clustered Data Region</h3>
<p>It’s a good idea to <strong>zoom in over clustered regions of interest for better visual display</strong>. (In R, this can be easily realized using the <code>facet_zoom()</code> function from the <Link to="https://ggforce.data-imaginist.com/index.html">ggforce</Link> package.)</p>
<div className="cell" data-layout-align="center">
<details className="code-fold">
<summary>Show the code</summary>
<div className="sourceCode cell-code" id="cb9"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb9-1"><a aria-hidden="true" href="#cb9-1" tabindex="-1"></a><span className="fu">library</span>(ggforce)</span>
<span id="cb9-2"><a aria-hidden="true" href="#cb9-2" tabindex="-1"></a></span><br/>
<span id="cb9-3"><a aria-hidden="true" href="#cb9-3" tabindex="-1"></a>p3 <span className="sc">+</span> </span>
<span id="cb9-4"><a aria-hidden="true" href="#cb9-4" tabindex="-1"></a>  <span className="co"># reset theme to black-white</span></span>
<span id="cb9-5"><a aria-hidden="true" href="#cb9-5" tabindex="-1"></a>  <span className="fu">theme_bw</span>() <span className="sc">+</span> </span>
<span id="cb9-6"><a aria-hidden="true" href="#cb9-6" tabindex="-1"></a>  <span className="co"># zoom in over specified axial ranges </span></span>
<span id="cb9-7"><a aria-hidden="true" href="#cb9-7" tabindex="-1"></a>  <span className="fu">facet_zoom</span>(</span>
<span id="cb9-8"><a aria-hidden="true" href="#cb9-8" tabindex="-1"></a>    <span className="at">xlim =</span> <span className="fu">c</span>(<span className="dv">0</span>, <span className="dv">500</span>), </span>
<span id="cb9-9"><a aria-hidden="true" href="#cb9-9" tabindex="-1"></a>    <span className="at">ylim =</span> <span className="fu">c</span>(<span className="dv">1</span> <span className="sc">*</span> <span className="dv">10</span><span className="sc">^</span><span className="dv">5</span>, <span className="fl">1.3</span> <span className="sc">*</span> <span className="dv">10</span><span className="sc">^</span><span className="dv">5</span>),</span>
<span id="cb9-10"><a aria-hidden="true" href="#cb9-10" tabindex="-1"></a>    <span className="at">split =</span> T,     <span className="co"># show separate zoom-in over x an y, and both axes</span></span>
<span id="cb9-11"><a aria-hidden="true" href="#cb9-11" tabindex="-1"></a>    <span className="at">zoom.size =</span> <span className="dv">1</span>) <span className="co"># same size as the main plot </span></span></code></pre></div>
</details>
<div className="cell-output-display">
<div className="quarto-figure quarto-figure-center">
  <picture>
    <source type="image/webp" srcset="https://s3.amazonaws.com/databrewer/media/graphics_blog/housing_skew_ggforce.webp" />
    <img className="img-fluid figure-img" src="graphics_blog/housing_skew_ggforce.png" data-fallback="graphics_blog/housing_skew_ggforce.png" />
  </picture>

</div>
</div>
</div>
</section>
</main>
</div>
</div>
)}