import React from 'react'; 
import {Link} from 'react-router-dom'; 
import {useSparkCustomEffect} from '../../useCustomEffect'; 
export default function PythonOutput(){
useSparkCustomEffect()
return ( <div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<h1 id="Create-Schema-for-PySpark-DataFrame">Create Schema for PySpark DataFrame<a class="anchor-link" href="#Create-Schema-for-PySpark-DataFrame">¶</a></h1><p>A schema defines the structure of a DataFrame, specifying the names and data types of its columns. It is essential for ensuring data consistency and integrity throughout DataFrame operations, such as creation, reading, and writing.</p>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<p>This article introduces two ways of creating DataFrame's schema:<br/><br/>
<a href="#Programmatic-Schema-Creation">1. Programmatic Schema Creation (Using StructType and StructField)</a></p>
<ul>
<li>Flexibility: You can specify the data types and nullable settings for each column individually.  </li>
<li>Readability: Provides a clear definition of the schema in your code.  </li>
<li>Type Safety: PySpark will check the data types at runtime against the specified schema.  </li>
</ul>
<p><a href="#String-Based-Schema-Definition">2. String-Based Schema Definition</a></p>
<ul>
<li>Conciseness: Easier to define a schema with a shorter syntax, especially for simple cases.  </li>
<li>Less Control: You cannot specify nullable settings or other advanced options.  </li>
<li>Potential Errors: No type checking at runtime; errors may occur if the data does not match the specified schema. </li>
</ul>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<h3 id="Create-a-Spark-Session">Create a Spark Session<a class="anchor-link" href="#Create-a-Spark-Session">¶</a></h3><p>To start, initalize a SparkSession and assign it to a variable <code>spark</code>.</p>
<p><em>If you're not familiar with Spark session, click <Link to="../park-session">here</Link> to learn more.</em></p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [5]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="highlight hl-ipython3"><pre className='demo-highlight python'><code className='sourceCode'><span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span></span>


<br /><span><span class="c1"># Initialize Spark Session</span></span>
<span><span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s2">"app"</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span></span>
</code></pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<h3 id="Programmatic-Schema-Creation">Programmatic Schema Creation<a class="anchor-link" href="#Programmatic-Schema-Creation">¶</a></h3><p>Create a <code>StructType</code> object with a list of <code>StructField</code> objects. Each <code>StructField</code> object defines the column name, column type, and whether it allows null values.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [16]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="highlight hl-ipython3"><pre className='demo-highlight python'><code className='sourceCode'><span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="n">StructType</span><span class="p">,</span> <span class="n">StructField</span><span class="p">,</span> <span class="n">StringType</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">,</span> <span class="n">LongType</span><span class="p">,</span> <span class="n">DoubleType</span><span class="p">,</span> <span class="n">DateType</span><span class="p">,</span> <span class="n">TimestampType</span></span>

<span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">date</span></span>

<br /><span><span class="c1"># Sample data (list of tuples)</span></span>
<span><span class="n">data</span> <span class="o">=</span> <span class="p">[</span></span>
<span>    <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">2.</span><span class="p">,</span> <span class="s1">'string1'</span><span class="p">,</span> <span class="n">date</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">)),</span></span>
<span>    <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">3.</span><span class="p">,</span> <span class="s1">'string2'</span><span class="p">,</span> <span class="n">date</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">)),</span></span>
<span>    <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mf">4.</span><span class="p">,</span> <span class="s1">'string3'</span><span class="p">,</span> <span class="n">date</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span></span>
<span><span class="p">]</span></span>

<br /><span><span class="c1"># Define a schema variable</span></span>
<span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span></span>
<span>    <span class="n">StructField</span><span class="p">(</span><span class="s2">"col1"</span><span class="p">,</span> <span class="n">LongType</span><span class="p">(),</span> <span class="n">nullable</span> <span class="o">=</span> <span class="kc">True</span><span class="p">),</span></span>
<span>    <span class="n">StructField</span><span class="p">(</span><span class="s2">"col2"</span><span class="p">,</span> <span class="n">DoubleType</span><span class="p">(),</span> <span class="n">nullable</span> <span class="o">=</span> <span class="kc">True</span><span class="p">),</span></span>
<span>    <span class="n">StructField</span><span class="p">(</span><span class="s2">"col3"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="n">nullable</span> <span class="o">=</span> <span class="kc">True</span><span class="p">),</span></span>
<span>    <span class="n">StructField</span><span class="p">(</span><span class="s2">"col4"</span><span class="p">,</span> <span class="n">DateType</span><span class="p">(),</span> <span class="n">nullable</span> <span class="o">=</span> <span class="kc">True</span><span class="p">),</span></span>
<span>    <span class="n">StructField</span><span class="p">(</span><span class="s2">"col5"</span><span class="p">,</span> <span class="n">TimestampType</span><span class="p">(),</span> <span class="n">nullable</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span></span>
<span><span class="p">])</span></span>
</code></pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<ul>
<li><code>StructType([])</code>: creates a StructType object with a list of StructField objects.</li>
<li><code>StructField(name, type, nullable)</code> accepts three arguments:<ul>
<li><code>name</code>: specify column names.</li>
<li><code>type</code>: define column type. In the example, five different data types are used. </li>
<li><code>nullable</code>: a boolean variable indicating whether the column allows null values.</li>
</ul>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<p>Pass the defined schema variable to the <code>createDataFrame()</code> function:</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [17]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="highlight hl-ipython3"><pre className='demo-highlight python'><code className='sourceCode'><span><span class="c1"># Pass the schema variable as an argument to the schema parameter</span></span>

<span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">)</span></span>
<span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span></span>
</code></pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre className='demo-highlight python'><code className='sourceCode'><span>+----+----+-------+----------+-------------------+
<br />|col1|col2|   col3|      col4|               col5|
<br />+----+----+-------+----------+-------------------+
<br />|   1| 2.0|string1|2000-01-01|2000-01-01 12:00:00|
<br />|   2| 3.0|string2|2000-02-01|2000-01-02 12:00:00|
<br />|   3| 4.0|string3|2000-03-01|2000-01-03 12:00:00|
<br />+----+----+-------+----------+-------------------+
<br /></span></code></pre>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<h3 id="String-Based-Schema-Definition">String-Based Schema Definition<a class="anchor-link" href="#String-Based-Schema-Definition">¶</a></h3><p>A string-based schema definition is usually used when you have a simple DataFrame structure.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [18]:</div>
<div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="highlight hl-ipython3"><pre className='demo-highlight python'><code className='sourceCode'><span><span class="c1"># Sample data (list of tuples)</span></span>

<span><span class="n">data</span> <span class="o">=</span> <span class="p">[</span></span>
<span>    <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">2.</span><span class="p">,</span> <span class="s1">'string1'</span><span class="p">,</span> <span class="n">date</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">)),</span></span>
<span>    <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">3.</span><span class="p">,</span> <span class="s1">'string2'</span><span class="p">,</span> <span class="n">date</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">)),</span></span>
<span>    <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mf">4.</span><span class="p">,</span> <span class="s1">'string3'</span><span class="p">,</span> <span class="n">date</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span></span>
<span><span class="p">]</span></span>

<br /><span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="s1">'col1 long, col2 double, col3 string, col4 date, col5 timestamp'</span><span class="p">)</span></span>
<span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span></span>
</code></pre></div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre className='demo-highlight python'><code className='sourceCode'><span>+----+----+-------+----------+-------------------+
<br />|col1|col2|   col3|      col4|               col5|
<br />+----+----+-------+----------+-------------------+
<br />|   1| 2.0|string1|2000-01-01|2000-01-01 12:00:00|
<br />|   2| 3.0|string2|2000-02-01|2000-01-02 12:00:00|
<br />|   3| 4.0|string3|2000-03-01|2000-01-03 12:00:00|
<br />+----+----+-------+----------+-------------------+
<br /></span></code></pre>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown">
<h4 id="Choosing-between-these-methods-depends-on-your-specific-use-case">Choosing between these methods depends on your specific use case<a class="anchor-link" href="#Choosing-between-these-methods-depends-on-your-specific-use-case">¶</a></h4><p><strong>Use StructType and StructField:</strong></p>
<ul>
<li>when you need precise control over the schema, including data types and nullable settings. </li>
<li>For more complex schemas or when you need to ensure type safety. </li>
</ul>
<p><strong>Use String-Based Schema:</strong></p>
<ul>
<li>for simple schemas where you don't need to specify nullable settings or have simple data types. </li>
<li>When you prioritize concise code over detailed schema definition.</li>
</ul>
</div>
</div>
</div>
</div>
</div>
)}