import React from 'react'; 
import {Link} from 'react-router-dom'; 
import {useRCustomEffect} from '../../../useCustomEffect'; 
export default function StrMatch(){
useRCustomEffect()
return ( <div>
<div className="page-columns page-rows-contents page-layout-article" id="quarto-content">
<main className="content" id="quarto-document-content">
<header className="quarto-title-block default" id="title-block-header">
<div className="quarto-title">
<h1 className="title">Match and Extract Capture Groups</h1>
</div>
<div className="quarto-title-meta">
</div>
</header>
<p>A capture group is a part of a regular expression that is enclosed in parentheses <code>()</code>. <code>str_match()</code> captures and extracts the matched patterns inside the parentheses, and return matched components as a character matrix.</p>
<p>Below we’ll illustrate capture group and <code>str_match()</code> covering the following four aspects using three excellent examples:</p>
<ul>
<li><a href="#basics">Basics of capture group with <code>str_match()</code> (e.g.1.)</a></li>
<li><a href="#named_capture_groups">Named capture groups (e.g.1.)</a></li>
<li><a href="#multiple_match">Extract components with multiple matches with <code>str_match_all()</code> (e.g.2.)</a></li>
<li><a href="#capture_group_tibble">Capture group with tibbles (e.g.3.)</a></li>
</ul>
<hr/>
<section className="level3" id="basics-of-capture-group-with-str_match">
<h3 className="anchored" data-anchor-id="basics-of-capture-group-with-str_match"><span id="basics">Basics of capture group with <code>str_match()</code></span></h3>
<p><strong><em>e.g. 1.</em></strong> Let’s first extract the month-date-year <em>without</em> using the capture group.</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb1"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb1-1"><a aria-hidden="true" href="#cb1-1" tabindex="-1"></a><span className="fu">library</span>(stringr)</span>
<span id="cb1-2"><a aria-hidden="true" href="#cb1-2" tabindex="-1"></a>dates <span className="ot">&lt;-</span> <span className="fu">c</span>(<span className="st">"Swedan, 12-22-1862"</span>,</span>
<span id="cb1-3"><a aria-hidden="true" href="#cb1-3" tabindex="-1"></a>           <span className="st">"Norway, 5-31-1864"</span>,</span>
<span id="cb1-4"><a aria-hidden="true" href="#cb1-4" tabindex="-1"></a>           <span className="st">"France, 12-6-1901"</span>)</span>
<span id="cb1-5"><a aria-hidden="true" href="#cb1-5" tabindex="-1"></a></span><br/>
<span id="cb1-6"><a aria-hidden="true" href="#cb1-6" tabindex="-1"></a><span className="co"># define the pattern of a month-day-year</span></span>
<span id="cb1-7"><a aria-hidden="true" href="#cb1-7" tabindex="-1"></a>p1 <span className="ot">&lt;-</span> <span className="st">"[0-9]&#123;1,2&#125;-[0-9]&#123;1,2&#125;-[0-9]&#123;4&#125;"</span></span></code></pre></div>
</div>
<div id="flex">
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb2"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb2-1"><a aria-hidden="true" href="#cb2-1" tabindex="-1"></a><span className="fu">str_view_all</span>(dates, p1)</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r">[1] │ Swedan, &lt;12-22-1862&gt;
<br/>[2] │ Norway, &lt;5-31-1864&gt;
<br/>[3] │ France, &lt;12-6-1901&gt;</code></pre>
</div>
</div>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb4"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb4-1"><a aria-hidden="true" href="#cb4-1" tabindex="-1"></a><span className="fu">str_extract</span>(dates, p1)</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r">[1] "12-22-1862" "5-31-1864"  "12-6-1901" </code></pre>
</div>
</div>
</div>
<p>To capture the month, date, and year, each as a separate component, wrap the associated regular expressions in parentheses, respectively. <code>str_match()</code> returns a character matrix, displaying both the entire string of month-day-year, and the individual captured components in separate columns.</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb6"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb6-1"><a aria-hidden="true" href="#cb6-1" tabindex="-1"></a>p2 <span className="ot">&lt;-</span> <span className="st">"([0-9]&#123;1,2&#125;)-([0-9]&#123;1,2&#125;)-([0-9]&#123;4&#125;)"</span></span>
<span id="cb6-2"><a aria-hidden="true" href="#cb6-2" tabindex="-1"></a><span className="fu">str_match</span>(dates, p2)</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r">     [,1]         [,2] [,3] [,4]  
<br/>[1,] "12-22-1862" "12" "22" "1862"
<br/>[2,] "5-31-1864"  "5"  "31" "1864"
<br/>[3,] "12-6-1901"  "12" "6"  "1901"</code></pre>
</div>
</div>
</section>
<section className="level3" id="named-capture-groups">
<h3 className="anchored" data-anchor-id="named-capture-groups"><span id="named_capture_groups">Named capture groups</span></h3>
<p>To create named captured groups, add <code>?&lt;name&gt;</code> before the matching pattern within the pair of parentheses, i.e., using the syntax <code>(?&lt;name&gt;pattern)</code>.</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb8"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb8-1"><a aria-hidden="true" href="#cb8-1" tabindex="-1"></a>p2.named <span className="ot">&lt;-</span> <span className="st">"(?&lt;MON&gt;[0-9]&#123;1,2&#125;)-(?&lt;DAY&gt;[0-9]&#123;1,2&#125;)-(?&lt;year&gt;[0-9]&#123;4&#125;)"</span></span>
<span id="cb8-2"><a aria-hidden="true" href="#cb8-2" tabindex="-1"></a><span className="fu">str_match</span>(dates, p2.named)</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r">                  MON  DAY  year  
<br/>[1,] "12-22-1862" "12" "22" "1862"
<br/>[2,] "5-31-1864"  "5"  "31" "1864"
<br/>[3,] "12-6-1901"  "12" "6"  "1901"</code></pre>
</div>
</div>
</section>
<section className="level3" id="extract-components-with-multiple-matches">
<h3 className="anchored" data-anchor-id="extract-components-with-multiple-matches"><span id="multiple_match">Extract components with multiple matches</span></h3>
<p><strong><em>e.g. 2.</em></strong> The following example extracts phone number components separated by hyphens, white spaces, or dots (matched with <code>[- .]</code>). <code>[0-9]&#123;3&#125;</code> matches the captured groups of three-digit area code and exchange code, i.e., “xxx”, and <code>[0-9]&#123;4&#125;</code> matches the four-digit line number, “xxxx”. Each capture group is wrapped in parentheses in the regular expression, and extracted and displayed in separate columns in the output.</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb10"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb10-1"><a aria-hidden="true" href="#cb10-1" tabindex="-1"></a>tel <span className="ot">&lt;-</span> <span className="fu">c</span>(<span className="st">"329-293-8753 "</span>,</span>
<span id="cb10-2"><a aria-hidden="true" href="#cb10-2" tabindex="-1"></a>         <span className="st">"239 923 8115 and 842 566 4692"</span>, </span>
<span id="cb10-3"><a aria-hidden="true" href="#cb10-3" tabindex="-1"></a>         <span className="st">"Work: 579-499-7527"</span>, </span>
<span id="cb10-4"><a aria-hidden="true" href="#cb10-4" tabindex="-1"></a>         <span className="st">"Home: 543.355.3679"</span>)</span>
<span id="cb10-5"><a aria-hidden="true" href="#cb10-5" tabindex="-1"></a></span><br/>
<span id="cb10-6"><a aria-hidden="true" href="#cb10-6" tabindex="-1"></a><span className="co"># define the pattern of a phone number</span></span>
<span id="cb10-7"><a aria-hidden="true" href="#cb10-7" tabindex="-1"></a>p <span className="ot">&lt;-</span> <span className="st">"([0-9]&#123;3&#125;)[- .]([0-9]&#123;3&#125;)[- .]([0-9]&#123;4&#125;)"</span></span>
<span id="cb10-8"><a aria-hidden="true" href="#cb10-8" tabindex="-1"></a></span><br/>
<span id="cb10-9"><a aria-hidden="true" href="#cb10-9" tabindex="-1"></a><span className="fu">str_match</span>(tel, p)</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r">     [,1]           [,2]  [,3]  [,4]  
<br/>[1,] "329-293-8753" "329" "293" "8753"
<br/>[2,] "239 923 8115" "239" "923" "8115"
<br/>[3,] "579-499-7527" "579" "499" "7527"
<br/>[4,] "543.355.3679" "543" "355" "3679"</code></pre>
</div>
</div>
<p>In <code>str_match()</code>, for each vector element, only the first match is selected, while all following matches are excluded from the output (similar to <Link to="/R/data-wrangling/stringr/12-str-extract"><code>str_extract()</code></Link>); e.g., the second phone number in the second string element is not returned. Use <code>str_match_all()</code> instead to extract all matches, and return the result as a list (similar to <Link to="/R/data-wrangling/stringr/12-str-extract"><code>str_extract_all()</code></Link>).</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb12"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb12-1"><a aria-hidden="true" href="#cb12-1" tabindex="-1"></a><span className="fu">str_match_all</span>(tel, p)</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r">[[1]]
<br/>     [,1]           [,2]  [,3]  [,4]  
<br/>[1,] "329-293-8753" "329" "293" "8753"

<br/>[[2]]
<br/>     [,1]           [,2]  [,3]  [,4]  
<br/>[1,] "239 923 8115" "239" "923" "8115"
<br/>[2,] "842 566 4692" "842" "566" "4692"

<br/>[[3]]
<br/>     [,1]           [,2]  [,3]  [,4]  
<br/>[1,] "579-499-7527" "579" "499" "7527"

<br/>[[4]]
<br/>     [,1]           [,2]  [,3]  [,4]  
<br/>[1,] "543.355.3679" "543" "355" "3679"</code></pre>
</div>
</div>
</section>
<section className="level3" id="capture-group-with-tibbles-matrix--and-tibble-column">
<h3 className="anchored" data-anchor-id="capture-group-with-tibbles-matrix--and-tibble-column"><span id="capture_group_tibble">Capture group with tibbles (matrix- and tibble-column)</span></h3>
<p>Tibble is the central data structure in <Link to="https://www.tidyverse.org/">tidyverse</Link>, and working with tibbles is an essential part of capture groups.</p>
<p><strong><em>e.g. 3.</em></strong> Here we use the <code>who</code> dataset (of <Link to="/R/data-wrangling/tidyr/introduction">tidyr</Link> package) from the World Health Organization Global Tuberculosis Report. For ease of demo, we’ll first do a simple data cleanup: use <Link to="/R/data-wrangling/dplyr/9-select-rows"><code>slice_max()</code></Link> to select the top 20 rows of records of the most tuberculosis outbreaks, and then use <Link to="/R/data-wrangling/tidyr/pivot-longer-part1"><code>pivot_longer()</code></Link> to transform the dataset into a tidy structure.</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb14"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb14-1"><a aria-hidden="true" href="#cb14-1" tabindex="-1"></a><span className="fu">library</span>(tidyr)</span>
<span id="cb14-2"><a aria-hidden="true" href="#cb14-2" tabindex="-1"></a><span className="fu">library</span>(dplyr)</span>
<span id="cb14-3"><a aria-hidden="true" href="#cb14-3" tabindex="-1"></a></span><br/>
<span id="cb14-4"><a aria-hidden="true" href="#cb14-4" tabindex="-1"></a><span className="co"># select top-20 rows containing most outbreaks in male at age 15-24 </span></span>
<span id="cb14-5"><a aria-hidden="true" href="#cb14-5" tabindex="-1"></a>who.max <span className="ot">&lt;-</span> who <span className="sc">%&gt;%</span> </span>
<span id="cb14-6"><a aria-hidden="true" href="#cb14-6" tabindex="-1"></a>  <span className="fu">slice_max</span>(<span className="at">order_by =</span> new_sp_m1524, <span className="at">n =</span> <span className="dv">20</span>)</span>
<span id="cb14-7"><a aria-hidden="true" href="#cb14-7" tabindex="-1"></a></span><br/>
<span id="cb14-8"><a aria-hidden="true" href="#cb14-8" tabindex="-1"></a><span className="co"># convert to tidy structure</span></span>
<span id="cb14-9"><a aria-hidden="true" href="#cb14-9" tabindex="-1"></a>who.tidy <span className="ot">&lt;-</span> who.max <span className="sc">%&gt;%</span> </span>
<span id="cb14-10"><a aria-hidden="true" href="#cb14-10" tabindex="-1"></a>  <span className="fu">pivot_longer</span>(<span className="sc">-</span><span className="fu">c</span>(<span className="dv">1</span><span className="sc">:</span><span className="dv">4</span>), <span className="at">names_to =</span> <span className="st">"condition"</span>, <span className="at">values_to =</span> <span className="st">"count"</span>) </span>
<span id="cb14-11"><a aria-hidden="true" href="#cb14-11" tabindex="-1"></a></span><br/>
<span id="cb14-12"><a aria-hidden="true" href="#cb14-12" tabindex="-1"></a>who.tidy</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r"># A tibble: 1,120 × 6
<br/>   country iso2  iso3   year condition    count
<br/>   &lt;chr&gt;   &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;        &lt;dbl&gt;
<br/> 1 India   IN    IND    2010 new_sp_m014   4871
<br/> 2 India   IN    IND    2010 new_sp_m1524 78278
<br/> 3 India   IN    IND    2010 new_sp_m2534 82757
<br/> 4 India   IN    IND    2010 new_sp_m3544 90440
<br/> 5 India   IN    IND    2010 new_sp_m4554 81210
<br/> 6 India   IN    IND    2010 new_sp_m5564 60766
<br/> 7 India   IN    IND    2010 new_sp_m65   38442
<br/> 8 India   IN    IND    2010 new_sp_f014   8544
<br/> 9 India   IN    IND    2010 new_sp_f1524 53415
<br/>10 India   IN    IND    2010 new_sp_f2534 49425
<br/># ℹ 1,110 more rows</code></pre>
</div>
</div>
<p>The <code>condition</code> column actually consists of four pieces of information:</p>
<ul>
<li>The <code>new</code> and <code>new_</code> prefix indicates that the counts are new tuberculosis cases.</li>
<li><code>sp</code>, <code>sn</code>, <code>ep</code>, and <code>rel</code> describe the diagnosis result: <code>sp</code>, positive pulmonary smear; <code>sn</code>, negative pulmonary smear; <code>ep</code>, extra pulmonary; <code>rel</code>, relapse.</li>
<li><code>m</code> an <code>f</code> indicates the gender: <code>m</code>, male; <code>f</code>, female.</li>
<li>The digits show the age ranges: <code>014</code>, 0-14 years of age; <code>1524</code>, 15-24 years of age; <code>2535</code>, 25-35 years of age, etc.</li>
</ul>
<p>Here we use capture groups to extract the four types of information, and split them into four more separate columns.</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb16"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb16-1"><a aria-hidden="true" href="#cb16-1" tabindex="-1"></a>pattern <span className="ot">&lt;-</span> <span className="st">"(new)_?(.*)_(.)(.*)"</span></span>
<span id="cb16-2"><a aria-hidden="true" href="#cb16-2" tabindex="-1"></a></span><br/>
<span id="cb16-3"><a aria-hidden="true" href="#cb16-3" tabindex="-1"></a>x <span className="ot">&lt;-</span> who.tidy <span className="sc">%&gt;%</span> </span>
<span id="cb16-4"><a aria-hidden="true" href="#cb16-4" tabindex="-1"></a>  <span className="fu">mutate</span>(<span className="at">a =</span> <span className="fu">str_match</span>(condition, pattern), </span>
<span id="cb16-5"><a aria-hidden="true" href="#cb16-5" tabindex="-1"></a>         <span className="at">.keep =</span> <span className="st">"unused"</span>)</span>
<span id="cb16-6"><a aria-hidden="true" href="#cb16-6" tabindex="-1"></a>x</span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r"># A tibble: 1,120 × 6
<br/>   country iso2  iso3   year count a[,1]        [,2]  [,3]  [,4]  [,5] 
<br/>   &lt;chr&gt;   &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;        &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
<br/> 1 India   IN    IND    2010  4871 new_sp_m014  new   sp    m     014  
<br/> 2 India   IN    IND    2010 78278 new_sp_m1524 new   sp    m     1524 
<br/> 3 India   IN    IND    2010 82757 new_sp_m2534 new   sp    m     2534 
<br/> 4 India   IN    IND    2010 90440 new_sp_m3544 new   sp    m     3544 
<br/> 5 India   IN    IND    2010 81210 new_sp_m4554 new   sp    m     4554 
<br/> 6 India   IN    IND    2010 60766 new_sp_m5564 new   sp    m     5564 
<br/> 7 India   IN    IND    2010 38442 new_sp_m65   new   sp    m     65   
<br/> 8 India   IN    IND    2010  8544 new_sp_f014  new   sp    f     014  
<br/> 9 India   IN    IND    2010 53415 new_sp_f1524 new   sp    f     1524 
<br/>10 India   IN    IND    2010 49425 new_sp_f2534 new   sp    f     2534 
<br/># ℹ 1,110 more rows</code></pre>
</div>
</div>
<p>Note that columns of <code>a[, 1]</code>, <code>[,2]</code>…<code>[,5]</code> are essentially <em>a single matrix-column</em>, which can be accessed and returned as a matrix by calling <code>x$a</code>. To make it easy for downstream analysis, we can split this matrix-column into separate columns by:</p>
<ul>
<li>first turning the matrix-column into a single tibble-column with <code>as_tibble()</code></li>
<li>and then unpack this tibble-column into separate columns with <code>unpack()</code>.</li>
</ul>
<p>In addition, we use named capture groups to add names to the newly generated columns.</p>
<div className="cell" data-layout-align="center">
<div className="sourceCode cell-code" id="cb18"><pre className="sourceCode r code-with-copy"><code className="sourceCode r"><span id="cb18-1"><a aria-hidden="true" href="#cb18-1" tabindex="-1"></a><span className="co"># create a named capture group</span></span>
<span id="cb18-2"><a aria-hidden="true" href="#cb18-2" tabindex="-1"></a>named.capture <span className="ot">&lt;-</span> <span className="st">"(?&lt;status&gt;new)_?(?&lt;type&gt;.*)_(?&lt;gender&gt;.)(?&lt;age&gt;.*)"</span></span>
<span id="cb18-3"><a aria-hidden="true" href="#cb18-3" tabindex="-1"></a></span><br/>
<span id="cb18-4"><a aria-hidden="true" href="#cb18-4" tabindex="-1"></a>who.tidy <span className="sc">%&gt;%</span> </span>
<span id="cb18-5"><a aria-hidden="true" href="#cb18-5" tabindex="-1"></a>  <span className="fu">mutate</span>(</span>
<span id="cb18-6"><a aria-hidden="true" href="#cb18-6" tabindex="-1"></a>    <span className="co"># capture, and return as a tibble</span></span>
<span id="cb18-7"><a aria-hidden="true" href="#cb18-7" tabindex="-1"></a>    <span className="at">a =</span> <span className="fu">str_match</span>(condition, named.capture) <span className="sc">%&gt;%</span> <span className="fu">as_tibble</span>(), </span>
<span id="cb18-8"><a aria-hidden="true" href="#cb18-8" tabindex="-1"></a>    <span className="at">.keep =</span> <span className="st">"unused"</span>) <span className="sc">%&gt;%</span> </span>
<span id="cb18-9"><a aria-hidden="true" href="#cb18-9" tabindex="-1"></a>  <span className="co"># unpack the single tibble-column into separate columns</span></span>
<span id="cb18-10"><a aria-hidden="true" href="#cb18-10" tabindex="-1"></a>  <span className="fu">unpack</span>(a) </span></code></pre></div>
<div className="cell-output cell-output-stdout">
<pre className="demo-highlight sourceCode r rcss"><code className="sourceCode r"># A tibble: 1,120 × 10
<br/>   country iso2  iso3   year count V1           status type  gender age  
<br/>   &lt;chr&gt;   &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;        &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;  &lt;chr&gt;
<br/> 1 India   IN    IND    2010  4871 new_sp_m014  new    sp    m      014  
<br/> 2 India   IN    IND    2010 78278 new_sp_m1524 new    sp    m      1524 
<br/> 3 India   IN    IND    2010 82757 new_sp_m2534 new    sp    m      2534 
<br/> 4 India   IN    IND    2010 90440 new_sp_m3544 new    sp    m      3544 
<br/> 5 India   IN    IND    2010 81210 new_sp_m4554 new    sp    m      4554 
<br/> 6 India   IN    IND    2010 60766 new_sp_m5564 new    sp    m      5564 
<br/> 7 India   IN    IND    2010 38442 new_sp_m65   new    sp    m      65   
<br/> 8 India   IN    IND    2010  8544 new_sp_f014  new    sp    f      014  
<br/> 9 India   IN    IND    2010 53415 new_sp_f1524 new    sp    f      1524 
<br/>10 India   IN    IND    2010 49425 new_sp_f2534 new    sp    f      2534 
<br/># ℹ 1,110 more rows</code></pre>
</div>
</div>
</section>
</main>
</div>
</div>
)}