Podcast-Episoden aggregieren

Podcast-Episoden von Webseiten herunterladen, die Daten analysieren und bereinigen und schließlich in Markdown exportieren.

Lernziele

Du wirst lernen, wie du mit Python und der requests-Bibliothek Webseiten herunterlädst und mit BeautifulSoup HTML-Inhalte analysierst. Außerdem wirst du lernen, wie du Fehler in deinem Code behandelst und Dateien effizient liest und schreibst. Zusätzlich wirst du Daten bereinigen und verarbeiten, DataFrames mit Pandas erstellen und bearbeiten sowie Daten ins Markdown-Format exportieren.

Requesting

In dieser Zelle wird ein Python-Skript definiert, das die Webseite “https://fyyd.de/search?page=0&search=digitalisierung” herunterlädt und als HTML-Datei speichert. Es verwendet die requests-Bibliothek für den Download und behandelt mögliche Fehler während des Prozesses.

# prompt: bitte erstelle python code der die seite https://fyyd.de/search?page=0&search=digitalisierung herunterlädt und als html speichert

import requests
from bs4 import BeautifulSoup

def download_and_save_html(url, filename="downloaded_page.html"):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        with open(filename, "w", encoding="utf-8") as file:
            file.write(response.text)
        print(f"Page downloaded and saved as {filename}")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    url = "https://fyyd.de/search?page=0&search=digitalisierung"
    download_and_save_html(url)

Scraping

In dieser Zelle wird ein Python-Skript definiert, das die Datei “downloaded_page.html” öffnet und alle Links daraus extrahiert. Es verwendet die BeautifulSoup-Bibliothek zur HTML-Analyse und speichert die extrahierten Links in einer Liste.

# prompt: schreibe python code der die datei downloaded_page.html öffnet und alle links extrahiert und in einer variablen speichert

import requests
from bs4 import BeautifulSoup

def extract_links(filename="downloaded_page.html"):
    try:
        with open(filename, "r", encoding="utf-8") as file:
            html_content = file.read()

        soup = BeautifulSoup(html_content, "html.parser")
        links = []
        for link in soup.find_all("a", href=True):
            links.append(link["href"])
        return links
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

if __name__ == "__main__":
    extracted_links = extract_links()

Cleaning

In dieser Zelle wird eine Liste episode_links erstellt, die alle Links aus der Variable extracted_links speichert, die den String “/episode/” enthalten. Es wird überprüft, ob jeder Link ein String ist und das Muster “/episode/” enthält, bevor er zur Liste hinzugefügt wird.

# prompt: schreibe python code der aus der variablen extracted_links alle string in einer liste ablegt die /episode/ beinhalten

episode_links = []
if extracted_links:
    for link in extracted_links:
        if isinstance(link, str) and "/episode/" in link:
            episode_links.append(link)

In dieser Zelle wird eine Liste cleaned_episode_links erstellt, die alle Links aus episode_links in das Format https://fyyd.de/episode/xxxxxxxxx/ umwandelt. Dabei werden Links, die mit /episode/ beginnen, entsprechend ergänzt, und Links mit /transcript werden bereinigt, um das gewünschte Format zu erreichen.

# prompt: reinige die strings in episode_links sodass aus '/episode/13289098',  '/episode/4591809',  '/episode/10661740',  'https://fyyd.de/episode/10661740/transcript#t4514',  '/episode/13287710',  '/episode/13289210', strings in folgendem format werden 'https://fyyd.de/episode/xxxxxxxxx/'

cleaned_episode_links = []
for link in episode_links:
    if link.startswith('/episode/'):
        cleaned_episode_links.append(f"https://fyyd.de{link}")
    elif "https://fyyd.de/episode/" in link:
        cleaned_episode_links.append(link.split("/transcript")[0] + "/") # handles transcript links
    else:
        cleaned_episode_links.append(link) # keep links that are already in correct format

len(cleaned_episode_links)

In dieser Zelle wird die Liste cleaned_episode_links in ein Set umgewandelt, um doppelte Einträge zu entfernen, und dann wieder in eine Liste konvertiert. Anschließend wird die Länge des Sets berechnet, um die Anzahl der eindeutigen Links in cleaned_episode_links zu ermitteln.

cleaned_episode_links = list(set(cleaned_episode_links))

len(set(cleaned_episode_links))

In dieser Zelle wird ein Python-Skript erstellt, das über alle Links in cleaned_episode_links iteriert und die HTML-Antworten in einem Ordner speichert. Für jeden Link wird eine HTTP-Anfrage gesendet, und die Antwort wird in einer Datei gespeichert, deren Name die Episoden-ID und einen Index enthält. Der Ordner episode_html wird erstellt, falls er noch nicht existiert, und Fehler während des Downloads werden behandelt und ausgegeben.

Zweites Requesting

# prompt: iteriere über alle links in cleaned_episode_links und speichere die html antworten in einem ordner füge die episode id als in den dateiname ein

import requests
from bs4 import BeautifulSoup
import os

# ... (your existing code) ...


def download_and_save_html(url, filename="downloaded_page.html"):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        with open(filename, "w", encoding="utf-8") as file:
            file.write(response.text)
        print(f"Page downloaded and saved as {filename}")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


# Create the directory if it doesn't exist
if not os.path.exists("episode_html"):
    os.makedirs("episode_html")

for i, link in enumerate(cleaned_episode_links):
    try:
        response = requests.get(link)
        response.raise_for_status()

        # Extract episode ID (you might need a more robust way to extract this)
        episode_id = link.split("/")[-2] if link.split("/")[-1] == "" else link.split("/")[-1] # handles trailing slashes

        filename = os.path.join("episode_html", f"episode_{episode_id}_{i}.html")  # Use episode ID in filename
        with open(filename, "w", encoding="utf-8") as f:
            f.write(response.text)
        print(f"Downloaded {link} to {filename}")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {link}: {e}")
    except Exception as e:
        print(f"An unexpected error occurred while processing {link}: {e}")

Scraping

In dieser Zelle wird ein Python-Skript erstellt, das über alle HTML-Dateien in einem angegebenen Verzeichnis (output_dir) iteriert, die Dateinamen sortiert und bestimmte Informationen aus den HTML-Inhalten extrahiert.

Das Skript beginnt mit dem Import der notwendigen Bibliotheken (pandas, os, BeautifulSoup und re). Es definiert eine Funktion process_html_files, die die HTML-Dateien im Verzeichnis episode_html verarbeitet.

Die Funktion listet alle HTML-Dateien im Verzeichnis auf und sortiert sie. Für jede Datei wird der HTML-Inhalt gelesen und mit BeautifulSoup geparst.

Es extrahiert den Titel der Seite aus dem <p>-Tag und das Datum und die Uhrzeit aus einem <span>-Tag mit einem title-Attribut, das das Wort “importiert” enthält. Die Episoden-ID wird aus dem Dateinamen extrahiert.</p> <p>Die extrahierten Daten (Titel, Datum und Uhrzeit, Episoden-ID) werden in einer Liste gespeichert und schließlich in ein Pandas DataFrame konvertiert. Das DataFrame wird zurückgegeben und kann weiter verwendet werden.</p> <div id="cell-21" class="cell" data-outputId="efff9d49-a576-46e6-b2a0-cbe9f6b35f37"> <div class="sourceCode" id="cb7"><pre class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># prompt: iteriere über alle html dateien in output_dir , sortiere die liste der dateinamen und extrahiere <title>fyyd: Studio 9: Welche Chancen bringt die elektronische Patientenakte?</title> und das datum und die uhrzeit aus <span title="importiert :17.03.2023 14:25"> in ein pandas dataframe als title date episode id</span></span> <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span> <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> os</span> <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> bs4 <span class="im">import</span> BeautifulSoup</span> <span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> re</span> <span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> process_html_files(output_dir<span class="op">=</span><span class="st">"episode_html"</span>):</span> <span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a> <span class="co">"""</span></span> <span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a><span class="co"> Iterates through HTML files, extracts title, date, and time, and creates a Pandas DataFrame.</span></span> <span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a><span class="co"> """</span></span> <span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a> html_files <span class="op">=</span> [f <span class="cf">for</span> f <span class="kw">in</span> os.listdir(output_dir) <span class="cf">if</span> f.endswith(<span class="st">".html"</span>)]</span> <span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a> html_files.sort() <span class="co"># Sort filenames</span></span> <span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a> data <span class="op">=</span> []</span> <span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> filename <span class="kw">in</span> html_files:</span> <span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a> filepath <span class="op">=</span> os.path.join(output_dir, filename)</span> <span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a> <span class="cf">try</span>:</span> <span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a> <span class="cf">with</span> <span class="bu">open</span>(filepath, <span class="st">"r"</span>, encoding<span class="op">=</span><span class="st">"utf-8"</span>) <span class="im">as</span> <span class="bu">file</span>:</span> <span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a> html_content <span class="op">=</span> <span class="bu">file</span>.read()</span> <span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a> soup <span class="op">=</span> BeautifulSoup(html_content, <span class="st">"html.parser"</span>)</span> <span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a> <span class="co"># Extract title using regular expressions for robustness</span></span> <span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a> title_tag <span class="op">=</span> soup.find(<span class="st">"title"</span>)</span> <span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> title_tag:</span> <span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a> title_match <span class="op">=</span> re.search(<span class="vs">r"<title>(.*?)</title>"</span>, <span class="bu">str</span>(title_tag))</span> <span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a> title <span class="op">=</span> title_match.group(<span class="dv">1</span>) <span class="cf">if</span> title_match <span class="cf">else</span> <span class="va">None</span></span> <span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a> <span class="cf">else</span>:</span> <span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a> title <span class="op">=</span> <span class="va">None</span></span> <span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a> <span class="co"># Extract date and time using find with span tag</span></span> <span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a> date_time_span <span class="op">=</span> soup.find(<span class="st">"span"</span>, attrs<span class="op">=</span>{<span class="st">"title"</span>:re.<span class="bu">compile</span>(<span class="vs">r"importiert"</span>)})</span> <span id="cb7-35"><a href="#cb7-35" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> date_time_span <span class="kw">and</span> date_time_span[<span class="st">"title"</span>]:</span> <span id="cb7-36"><a href="#cb7-36" aria-hidden="true" tabindex="-1"></a> date_time_str <span class="op">=</span> date_time_span[<span class="st">"title"</span>].split(<span class="st">":"</span>, <span class="dv">1</span>)[<span class="dv">1</span>] <span class="co"># split string at first ":"</span></span> <span id="cb7-37"><a href="#cb7-37" aria-hidden="true" tabindex="-1"></a> <span class="co"># date_time_str = date_time_span["title"]</span></span> <span id="cb7-38"><a href="#cb7-38" aria-hidden="true" tabindex="-1"></a> <span class="cf">else</span>:</span> <span id="cb7-39"><a href="#cb7-39" aria-hidden="true" tabindex="-1"></a> date_time_str <span class="op">=</span> <span class="va">None</span></span> <span id="cb7-40"><a href="#cb7-40" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-41"><a href="#cb7-41" aria-hidden="true" tabindex="-1"></a> episode_id <span class="op">=</span> filename.split(<span class="st">"_"</span>)[<span class="dv">1</span>] <span class="co"># extract episode ID from filename</span></span> <span id="cb7-42"><a href="#cb7-42" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-43"><a href="#cb7-43" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-44"><a href="#cb7-44" aria-hidden="true" tabindex="-1"></a> data.append([title, date_time_str, episode_id])</span> <span id="cb7-45"><a href="#cb7-45" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-46"><a href="#cb7-46" aria-hidden="true" tabindex="-1"></a> <span class="cf">except</span> <span class="pp">FileNotFoundError</span>:</span> <span id="cb7-47"><a href="#cb7-47" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Error: File '</span><span class="sc">{</span>filename<span class="sc">}</span><span class="ss">' not found."</span>)</span> <span id="cb7-48"><a href="#cb7-48" aria-hidden="true" tabindex="-1"></a> <span class="cf">except</span> <span class="pp">Exception</span> <span class="im">as</span> e:</span> <span id="cb7-49"><a href="#cb7-49" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"An unexpected error occurred while processing </span><span class="sc">{</span>filename<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span>e<span class="sc">}</span><span class="ss">"</span>)</span> <span id="cb7-50"><a href="#cb7-50" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-51"><a href="#cb7-51" aria-hidden="true" tabindex="-1"></a> df <span class="op">=</span> pd.DataFrame(data, columns<span class="op">=</span>[<span class="st">"title"</span>, <span class="st">"date"</span>, <span class="st">"id"</span>])</span> <span id="cb7-52"><a href="#cb7-52" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> df</span> <span id="cb7-53"><a href="#cb7-53" aria-hidden="true" tabindex="-1"></a></span> <span id="cb7-54"><a href="#cb7-54" aria-hidden="true" tabindex="-1"></a><span class="co"># Example usage (assuming you have the episode_html directory)</span></span> <span id="cb7-55"><a href="#cb7-55" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> process_html_files()</span> <span id="cb7-56"><a href="#cb7-56" aria-hidden="true" tabindex="-1"></a>df</span></code></pre></div> </div> </section> <section id="cleaning-1" class="level2"> <h2>Cleaning</h2> <p>In dieser Zelle wird der String “fyyd:” am Beginn jeder Zeile in der Spalte title des DataFrames df entfernt. Dies wird durch die Verwendung der str.replace-Methode erreicht, die den regulären Ausdruck ^fyyd: (der für “fyyd:” am Anfang eines Strings steht) durch einen leeren String ersetzt. Das Ergebnis wird in der Spalte title des DataFrames df gespeichert.</p> <div id="cell-24" class="cell" data-outputId="7f8d131a-2c26-4c08-a8ec-eaf881e51315"> <div class="sourceCode" id="cb8"><pre class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># prompt: entferne das "fyyd: " am beginn jeder reihe in df.title</span></span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Remove "fyyd: " from the beginning of each row in df.title</span></span> <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'title'</span>] <span class="op">=</span> df[<span class="st">'title'</span>].<span class="bu">str</span>.replace(<span class="vs">r'^fyyd: '</span>, <span class="st">''</span>, regex<span class="op">=</span><span class="va">True</span>)</span> <span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a></span> <span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>df</span></code></pre></div> </div> </section> <section id="datumsformatierung" class="level2"> <h2>Datumsformatierung</h2> <p>In dieser Zelle wird die Spalte date im DataFrame df in Datetime-Objekte umgewandelt, wobei das Format %d.%m.%Y %H:%M verwendet wird. Fehlerhafte Datumsangaben werden dabei ignoriert (errors=‘coerce’). Anschließend wird der DataFrame nach der Spalte date in absteigender Reihenfolge sortiert, sodass die neuesten Daten zuerst erscheinen. Schließlich wird der aktualisierte DataFrame angezeigt.</p> <div id="cell-27" class="cell" data-outputId="4884c18a-f38d-492e-db56-6f85b9bcd563"> <div class="sourceCode" id="cb9"><pre class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># prompt: Using dataframe df: formatiere die spalte date und sortiere nach date beginnend beim jüngsten datum</span></span> <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Convert the 'date' column to datetime objects</span></span> <span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'date'</span>] <span class="op">=</span> pd.to_datetime(df[<span class="st">'date'</span>], <span class="bu">format</span><span class="op">=</span><span class="st">'</span><span class="sc">%d</span><span class="st">.%m.%Y %H:%M'</span>, errors<span class="op">=</span><span class="st">'coerce'</span>)</span> <span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a></span> <span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Sort the DataFrame by the 'date' column in descending order (most recent first)</span></span> <span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> df.sort_values(by<span class="op">=</span><span class="st">'date'</span>, ascending<span class="op">=</span><span class="va">False</span>)</span> <span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a></span> <span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a><span class="co"># Display the updated DataFrame</span></span> <span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>df</span></code></pre></div> </div> </section> <section id="zusätzliche-features" class="level2"> <h2>Zusätzliche Features</h2> <p>In dieser Zelle wird eine neue Spalte url im DataFrame df erstellt, indem die Basis-URL https://fyyd.de/episode/ mit den Werten der Spalte id verkettet wird. Dadurch wird für jede Episode eine vollständige URL generiert. Anschließend wird der aktualisierte DataFrame angezeigt, um die Änderungen zu überprüfen.</p> <div id="cell-30" class="cell" data-execution_count="10"> <div class="sourceCode" id="cb10"><pre class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># prompt: Using dataframe df: formatier die id als url mit folgendem format https://fyyd.de/episode/id</span></span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Import necessary libraries (if not already imported)</span></span> <span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span> <span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Assuming 'df' is your DataFrame</span></span> <span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Create the URL column by concatenating the base URL with the 'id' column.</span></span> <span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'url'</span>] <span class="op">=</span> <span class="st">'https://fyyd.de/episode/'</span> <span class="op">+</span> df[<span class="st">'id'</span>]</span> <span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a><span class="co"># Display the updated DataFrame to verify the changes.</span></span> <span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co">#print(df.head())</span></span></code></pre></div> </div> <p>In dieser Zelle werden Duplikate aus dem DataFrame df entfernt. Dies geschieht durch die Methode drop_duplicates(), die alle Zeilen entfernt, die in allen Spalten identisch sind. Anschließend wird der aktualisierte DataFrame angezeigt.</p> <div id="cell-32" class="cell" data-outputId="e4b481c0-b85d-4993-cb0f-d0514a17d085"> <div class="sourceCode" id="cb11"><pre class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># prompt: entferne duplikate von df</span></span> <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span> <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Assuming 'df' is your DataFrame</span></span> <span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a></span> <span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Remove duplicates based on all columns</span></span> <span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> df.drop_duplicates()</span> <span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a></span> <span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Display the updated DataFrame</span></span> <span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a>df</span></code></pre></div> </div> </section> <section id="export-als-markdown-tabelle" class="level2"> <h2>Export als Markdown Tabelle</h2> <p>In dieser Zelle wird der DataFrame df neu indexiert und eine neue Spalte index hinzugefügt, die bei 1 beginnt. Die Spalte date wird so formatiert, dass nur das Datum ohne die Uhrzeit angezeigt wird. Anschließend wird eine Markdown-formatierte Tabelle erstellt, in der die Spalte title als Markdown-Link formatiert ist (title). Die resultierende Tabelle enthält die Spalten index, title und date. Schließlich wird die Markdown-Tabelle angezeigt und kann optional in eine Datei gespeichert werden.</p> <div id="cell-35" class="cell" data-outputId="d709155c-3b45-4ecb-a0e6-857beed4f5b5"> <div class="sourceCode" id="cb12"><pre class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># prompt: reset the index and show as markdown formatted table mit title als markdown link [title](url) und date ohne uhrzeit, die url muss nicht als gesonderte spalte aufgeführt werden, füge einen index hinzu, die tabelle soll am ende die spalten index titel datum besitzen</span></span> <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Assuming 'df' is your DataFrame from the previous code</span></span> <span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Reset the index and create a new 'index' column</span></span> <span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> df.reset_index(drop<span class="op">=</span><span class="va">True</span>)</span> <span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a>df.index <span class="op">=</span> df.index <span class="op">+</span> <span class="dv">1</span> <span class="co"># Start index from 1</span></span> <span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> df.rename_axis(<span class="st">'index'</span>)</span> <span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a><span class="co"># Format the 'date' column to only show the date without the time</span></span> <span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'date'</span>] <span class="op">=</span> df[<span class="st">'date'</span>].dt.strftime(<span class="st">'</span><span class="sc">%d</span><span class="st">.%m.%Y'</span>)</span> <span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a><span class="co"># Create the Markdown-formatted table</span></span> <span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a>markdown_table <span class="op">=</span> <span class="st">"| index | title | date |</span><span class="ch">\n</span><span class="st">"</span></span> <span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a>markdown_table <span class="op">+=</span> <span class="st">"|---|---|---|</span><span class="ch">\n</span><span class="st">"</span> <span class="co"># Separator row</span></span> <span id="cb12-17"><a href="#cb12-17" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-18"><a href="#cb12-18" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> index, row <span class="kw">in</span> df.iterrows():</span> <span id="cb12-19"><a href="#cb12-19" aria-hidden="true" tabindex="-1"></a> title_link <span class="op">=</span> <span class="ss">f"[</span><span class="sc">{</span>row[<span class="st">'title'</span>]<span class="sc">}</span><span class="ss">](</span><span class="sc">{</span>row[<span class="st">'url'</span>]<span class="sc">}</span><span class="ss">)"</span></span> <span id="cb12-20"><a href="#cb12-20" aria-hidden="true" tabindex="-1"></a> markdown_table <span class="op">+=</span> <span class="ss">f"| </span><span class="sc">{</span>index<span class="sc">}</span><span class="ss"> | </span><span class="sc">{</span>title_link<span class="sc">}</span><span class="ss"> | </span><span class="sc">{</span>row[<span class="st">'date'</span>]<span class="sc">}</span><span class="ss"> |</span><span class="ch">\n</span><span class="ss">"</span></span> <span id="cb12-21"><a href="#cb12-21" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-22"><a href="#cb12-22" aria-hidden="true" tabindex="-1"></a><span class="co"># Display the markdown table</span></span> <span id="cb12-23"><a href="#cb12-23" aria-hidden="true" tabindex="-1"></a>markdown_table</span> <span id="cb12-24"><a href="#cb12-24" aria-hidden="true" tabindex="-1"></a></span> <span id="cb12-25"><a href="#cb12-25" aria-hidden="true" tabindex="-1"></a><span class="co"># You can also save the markdown table to a file if you want:</span></span> <span id="cb12-26"><a href="#cb12-26" aria-hidden="true" tabindex="-1"></a><span class="co"># with open("output.md", "w") as f:</span></span> <span id="cb12-27"><a href="#cb12-27" aria-hidden="true" tabindex="-1"></a><span class="co"># f.write(markdown_table)</span></span></code></pre></div> </div> </section> <section id="fazit" class="level2"> <h2>Fazit</h2> <p>Dieses Notebook zeigt, wie man Podcast-Episoden von Webseiten herunterlädt, die Daten bereinigt und in ein einheitliches Format bringt, um sie schließlich in einer Markdown-Tabelle darzustellen. Es bietet eine umfassende Anleitung zur Datenextraktion, -verarbeitung und -formatierung mit Python. Der Nutzen dieses Notebooks liegt in der Automatisierung und Vereinfachung des Prozesses, um strukturierte Daten aus unstrukturierten Webinhalten zu gewinnen und übersichtlich darzustellen.</p> <div id="quarto-navigation-envelope" class="hidden"> <p><span class="hidden" data-render-id="quarto-int-sidebar-title">Use-Case</span> <span class="hidden" data-render-id="quarto-int-navbar-title">🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht</span> <span class="hidden" data-render-id="quarto-int-prev">Buergergeld Forum💬</span> <span class="hidden" data-render-id="quarto-int-sidebar:quarto-sidebar-section-1">Anwendungsfall Übersicht</span> <span class="hidden" data-render-id="quarto-int-sidebar:/04_use_case/laws/Gesetze_im_Internet_Aktualitätendienst.htmlAktualitätendienst-Gesetze📜">Aktualitätendienst Gesetze📜</span> <span class="hidden" data-render-id="quarto-int-sidebar:/04_use_case/jobs/Jobboerse_API.htmlJobbörse💼">Jobbörse💼</span> <span class="hidden" data-render-id="quarto-int-sidebar:/04_use_case/forum/buergergeld_forum.htmlBuergergeld-Forum💬">Buergergeld Forum💬</span> <span class="hidden" data-render-id="quarto-int-sidebar:quarto-sidebar-section-2">Anwendungsfall Bonus</span> <span class="hidden" data-render-id="quarto-int-sidebar:/04_use_case_bonus/podcasts/aggregate_podcast_episodes_to_markdown.htmlPodcasts-aggregieren">Podcasts aggregieren</span> <span class="hidden" data-render-id="quarto-int-navbar:1️⃣ Start">1️⃣ Start</span> <span class="hidden" data-render-id="quarto-int-navbar:/index.html">/index.html</span> <span class="hidden" data-render-id="quarto-int-navbar:2️⃣ No Code">2️⃣ No Code</span> <span class="hidden" data-render-id="quarto-int-navbar:/basics.html">/basics.html</span> <span class="hidden" data-render-id="quarto-int-navbar:3️⃣ Low Code">3️⃣ Low Code</span> <span class="hidden" data-render-id="quarto-int-navbar:/low_code.html">/low_code.html</span> <span class="hidden" data-render-id="quarto-int-navbar:4️⃣ Anwendungsfall">4️⃣ Anwendungsfall</span> <span class="hidden" data-render-id="quarto-int-navbar:/use_case.html">/use_case.html</span> <span class="hidden" data-render-id="quarto-breadcrumbs-Anwendungsfall-Bonus">Anwendungsfall Bonus</span> <span class="hidden" data-render-id="quarto-breadcrumbs-Podcasts-aggregieren">Podcasts aggregieren</span></p> </div> <div id="quarto-meta-markdown" class="hidden"> <p><span class="hidden" data-render-id="quarto-metatitle">🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht - Podcast-Episoden aggregieren</span> <span class="hidden" data-render-id="quarto-twittercardtitle">🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht - Podcast-Episoden aggregieren</span> <span class="hidden" data-render-id="quarto-ogcardtitle">🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht - Podcast-Episoden aggregieren</span> <span class="hidden" data-render-id="quarto-metasitename">🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht</span> <span class="hidden" data-render-id="quarto-twittercarddesc">Podcast-Episoden von Webseiten herunterladen, die Daten analysieren und bereinigen und schließlich in Markdown exportieren.</span> <span class="hidden" data-render-id="quarto-ogcardddesc">Podcast-Episoden von Webseiten herunterladen, die Daten analysieren und bereinigen und schließlich in Markdown exportieren.</span></p> </div> </section> </main> <!-- /main --> <script id = "quarto-html-after-body" type="application/javascript"> window.document.addEventListener("DOMContentLoaded", function (event) { const toggleBodyColorMode = (bsSheetEl) => { const mode = bsSheetEl.getAttribute("data-mode"); const bodyEl = window.document.querySelector("body"); if (mode === "dark") { bodyEl.classList.add("quarto-dark"); bodyEl.classList.remove("quarto-light"); } else { bodyEl.classList.add("quarto-light"); bodyEl.classList.remove("quarto-dark"); } } const toggleBodyColorPrimary = () => { const bsSheetEl = window.document.querySelector("link#quarto-bootstrap"); if (bsSheetEl) { toggleBodyColorMode(bsSheetEl); } } toggleBodyColorPrimary(); const icon = ""; const anchorJS = new window.AnchorJS(); anchorJS.options = { placement: 'right', icon: icon }; anchorJS.add('.anchored'); const isCodeAnnotation = (el) => { for (const clz of el.classList) { if (clz.startsWith('code-annotation-')) { return true; } } return false; } const clipboard = new window.ClipboardJS('.code-copy-button', { text: function(trigger) { const codeEl = trigger.previousElementSibling.cloneNode(true); for (const childEl of codeEl.children) { if (isCodeAnnotation(childEl)) { childEl.remove(); } } return codeEl.innerText; } }); clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus button.blur(); // flash "checked" button.classList.add('code-copy-button-checked'); var currentTitle = button.getAttribute("title"); button.setAttribute("title", "Copied!"); let tooltip; if (window.bootstrap) { button.setAttribute("data-bs-toggle", "tooltip"); button.setAttribute("data-bs-placement", "left"); button.setAttribute("data-bs-title", "Copied!"); tooltip = new bootstrap.Tooltip(button, { trigger: "manual", customClass: "code-copy-button-tooltip", offset: [0, -8]}); tooltip.show(); } setTimeout(function() { if (tooltip) { tooltip.hide(); button.removeAttribute("data-bs-title"); button.removeAttribute("data-bs-toggle"); button.removeAttribute("data-bs-placement"); } button.setAttribute("title", currentTitle); button.classList.remove('code-copy-button-checked'); }, 1000); // clear code selection e.clearSelection(); }); function tippyHover(el, contentFn, onTriggerFn, onUntriggerFn) { const config = { allowHTML: true, maxWidth: 500, delay: 100, arrow: false, appendTo: function(el) { return el.parentElement; }, interactive: true, interactiveBorder: 10, theme: 'quarto', placement: 'bottom-start', }; if (contentFn) { config.content = contentFn; } if (onTriggerFn) { config.onTrigger = onTriggerFn; } if (onUntriggerFn) { config.onUntrigger = onUntriggerFn; } window.tippy(el, config); } const noterefs = window.document.querySelectorAll('a[role="doc-noteref"]'); for (var i=0; i<noterefs.length; i++) { const ref = noterefs[i]; tippyHover(ref, function() { // use id or data attribute instead here let href = ref.getAttribute('data-footnote-href') || ref.getAttribute('href'); try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); return note.innerHTML; }); } const xrefs = window.document.querySelectorAll('a.quarto-xref'); const processXRef = (id, note) => { // Strip column container classes const stripColumnClz = (el) => { el.classList.remove("page-full", "page-columns"); if (el.children) { for (const child of el.children) { stripColumnClz(child); } } } stripColumnClz(note) if (id === null || id.startsWith('sec-')) { // Special case sections, only their first couple elements const container = document.createElement("div"); if (note.children && note.children.length > 2) { container.appendChild(note.children[0].cloneNode(true)); for (let i = 1; i < note.children.length; i++) { const child = note.children[i]; if (child.tagName === "P" && child.innerText === "") { continue; } else { container.appendChild(child.cloneNode(true)); break; } } if (window.Quarto?.typesetMath) { window.Quarto.typesetMath(container); } return container.innerHTML } else { if (window.Quarto?.typesetMath) { window.Quarto.typesetMath(note); } return note.innerHTML; } } else { // Remove any anchor links if they are present const anchorLink = note.querySelector('a.anchorjs-link'); if (anchorLink) { anchorLink.remove(); } if (window.Quarto?.typesetMath) { window.Quarto.typesetMath(note); } // TODO in 1.5, we should make sure this works without a callout special case if (note.classList.contains("callout")) { return note.outerHTML; } else { return note.innerHTML; } } } for (var i=0; i<xrefs.length; i++) { const xref = xrefs[i]; tippyHover(xref, undefined, function(instance) { instance.disable(); let url = xref.getAttribute('href'); let hash = undefined; if (url.startsWith('#')) { hash = url; } else { try { hash = new URL(url).hash; } catch {} } if (hash) { const id = hash.replace(/^#\/?/, ""); const note = window.document.getElementById(id); if (note !== null) { try { const html = processXRef(id, note.cloneNode(true)); instance.setContent(html); } finally { instance.enable(); instance.show(); } } else { // See if we can fetch this fetch(url.split('#')[0]) .then(res => res.text()) .then(html => { const parser = new DOMParser(); const htmlDoc = parser.parseFromString(html, "text/html"); const note = htmlDoc.getElementById(id); if (note !== null) { const html = processXRef(id, note); instance.setContent(html); } }).finally(() => { instance.enable(); instance.show(); }); } } else { // See if we can fetch a full url (with no hash to target) // This is a special case and we should probably do some content thinning / targeting fetch(url) .then(res => res.text()) .then(html => { const parser = new DOMParser(); const htmlDoc = parser.parseFromString(html, "text/html"); const note = htmlDoc.querySelector('main.content'); if (note !== null) { // This should only happen for chapter cross references // (since there is no id in the URL) // remove the first header if (note.children.length > 0 && note.children[0].tagName === "HEADER") { note.children[0].remove(); } const html = processXRef(null, note); instance.setContent(html); } }).finally(() => { instance.enable(); instance.show(); }); } }, function(instance) { }); } let selectedAnnoteEl; const selectorForAnnotation = ( cell, annotation) => { let cellAttr = 'data-code-cell="' + cell + '"'; let lineAttr = 'data-code-annotation="' + annotation + '"'; const selector = 'span[' + cellAttr + '][' + lineAttr + ']'; return selector; } const selectCodeLines = (annoteEl) => { const doc = window.document; const targetCell = annoteEl.getAttribute("data-target-cell"); const targetAnnotation = annoteEl.getAttribute("data-target-annotation"); const annoteSpan = window.document.querySelector(selectorForAnnotation(targetCell, targetAnnotation)); const lines = annoteSpan.getAttribute("data-code-lines").split(","); const lineIds = lines.map((line) => { return targetCell + "-" + line; }) let top = null; let height = null; let parent = null; if (lineIds.length > 0) { //compute the position of the single el (top and bottom and make a div) const el = window.document.getElementById(lineIds[0]); top = el.offsetTop; height = el.offsetHeight; parent = el.parentElement.parentElement; if (lineIds.length > 1) { const lastEl = window.document.getElementById(lineIds[lineIds.length - 1]); const bottom = lastEl.offsetTop + lastEl.offsetHeight; height = bottom - top; } if (top !== null && height !== null && parent !== null) { // cook up a div (if necessary) and position it let div = window.document.getElementById("code-annotation-line-highlight"); if (div === null) { div = window.document.createElement("div"); div.setAttribute("id", "code-annotation-line-highlight"); div.style.position = 'absolute'; parent.appendChild(div); } div.style.top = top - 2 + "px"; div.style.height = height + 4 + "px"; div.style.left = 0; let gutterDiv = window.document.getElementById("code-annotation-line-highlight-gutter"); if (gutterDiv === null) { gutterDiv = window.document.createElement("div"); gutterDiv.setAttribute("id", "code-annotation-line-highlight-gutter"); gutterDiv.style.position = 'absolute'; const codeCell = window.document.getElementById(targetCell); const gutter = codeCell.querySelector('.code-annotation-gutter'); gutter.appendChild(gutterDiv); } gutterDiv.style.top = top - 2 + "px"; gutterDiv.style.height = height + 4 + "px"; } selectedAnnoteEl = annoteEl; } }; const unselectCodeLines = () => { const elementsIds = ["code-annotation-line-highlight", "code-annotation-line-highlight-gutter"]; elementsIds.forEach((elId) => { const div = window.document.getElementById(elId); if (div) { div.remove(); } }); selectedAnnoteEl = undefined; }; // Handle positioning of the toggle window.addEventListener( "resize", throttle(() => { elRect = undefined; if (selectedAnnoteEl) { selectCodeLines(selectedAnnoteEl); } }, 10) ); function throttle(fn, ms) { let throttle = false; let timer; return (...args) => { if(!throttle) { // first call gets through fn.apply(this, args); throttle = true; } else { // all the others get throttled if(timer) clearTimeout(timer); // cancel #2 timer = setTimeout(() => { fn.apply(this, args); timer = throttle = false; }, ms); } }; } // Attach click handler to the DT const annoteDls = window.document.querySelectorAll('dt[data-target-cell]'); for (const annoteDlNode of annoteDls) { annoteDlNode.addEventListener('click', (event) => { const clickedEl = event.target; if (clickedEl !== selectedAnnoteEl) { unselectCodeLines(); const activeEl = window.document.querySelector('dt[data-target-cell].code-annotation-active'); if (activeEl) { activeEl.classList.remove('code-annotation-active'); } selectCodeLines(clickedEl); clickedEl.classList.add('code-annotation-active'); } else { // Unselect the line unselectCodeLines(); clickedEl.classList.remove('code-annotation-active'); } }); } const findCites = (el) => { const parentEl = el.parentElement; if (parentEl) { const cites = parentEl.dataset.cites; if (cites) { return { el, cites: cites.split(' ') }; } else { return findCites(el.parentElement) } } else { return undefined; } }; var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); for (var i=0; i<bibliorefs.length; i++) { const ref = bibliorefs[i]; const citeInfo = findCites(ref); if (citeInfo) { tippyHover(citeInfo.el, function() { var popup = window.document.createElement('div'); citeInfo.cites.forEach(function(cite) { var citeDiv = window.document.createElement('div'); citeDiv.classList.add('hanging-indent'); citeDiv.classList.add('csl-entry'); var biblioDiv = window.document.getElementById('ref-' + cite); if (biblioDiv) { citeDiv.innerHTML = biblioDiv.innerHTML; } popup.appendChild(citeDiv); }); return popup.innerHTML; }); } } }); </script> <nav class="page-navigation"> <div class="nav-page nav-page-previous"> <a href="/04_use_case/forum/buergergeld_forum.html" class="pagination-link" aria-label="Buergergeld Forum💬"> <i class="bi bi-arrow-left-short"></i> <span class="nav-page-text">Buergergeld Forum💬</span> </a> </div> <div class="nav-page nav-page-next"> </div> </nav> </div> <!-- /content --> </body> </html>