Podcast-Episoden aggregieren

Podcast-Episoden von Webseiten herunterladen, die Daten analysieren und bereinigen und schließlich in Markdown exportieren.

Lernziele

Du wirst lernen, wie du mit Python und der requests-Bibliothek Webseiten herunterlädst und mit BeautifulSoup HTML-Inhalte analysierst. Außerdem wirst du lernen, wie du Fehler in deinem Code behandelst und Dateien effizient liest und schreibst. Zusätzlich wirst du Daten bereinigen und verarbeiten, DataFrames mit Pandas erstellen und bearbeiten sowie Daten ins Markdown-Format exportieren.

Requesting

In dieser Zelle wird ein Python-Skript definiert, das die Webseite “https://fyyd.de/search?page=0&search=digitalisierung” herunterlädt und als HTML-Datei speichert. Es verwendet die requests-Bibliothek für den Download und behandelt mögliche Fehler während des Prozesses.

# prompt: bitte erstelle python code der die seite https://fyyd.de/search?page=0&search=digitalisierung herunterlädt und als html speichert

import requests
from bs4 import BeautifulSoup

def download_and_save_html(url, filename="downloaded_page.html"):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        with open(filename, "w", encoding="utf-8") as file:
            file.write(response.text)
        print(f"Page downloaded and saved as {filename}")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    url = "https://fyyd.de/search?page=0&search=digitalisierung"
    download_and_save_html(url)

Scraping

In dieser Zelle wird ein Python-Skript definiert, das die Datei “downloaded_page.html” öffnet und alle Links daraus extrahiert. Es verwendet die BeautifulSoup-Bibliothek zur HTML-Analyse und speichert die extrahierten Links in einer Liste.

# prompt: schreibe python code der die datei downloaded_page.html öffnet und alle links extrahiert und in einer variablen speichert

import requests
from bs4 import BeautifulSoup

def extract_links(filename="downloaded_page.html"):
    try:
        with open(filename, "r", encoding="utf-8") as file:
            html_content = file.read()

        soup = BeautifulSoup(html_content, "html.parser")
        links = []
        for link in soup.find_all("a", href=True):
            links.append(link["href"])
        return links
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

if __name__ == "__main__":
    extracted_links = extract_links()

Cleaning

In dieser Zelle wird eine Liste episode_links erstellt, die alle Links aus der Variable extracted_links speichert, die den String “/episode/” enthalten. Es wird überprüft, ob jeder Link ein String ist und das Muster “/episode/” enthält, bevor er zur Liste hinzugefügt wird.

# prompt: schreibe python code der aus der variablen extracted_links alle string in einer liste ablegt die /episode/ beinhalten

episode_links = []
if extracted_links:
    for link in extracted_links:
        if isinstance(link, str) and "/episode/" in link:
            episode_links.append(link)

In dieser Zelle wird eine Liste cleaned_episode_links erstellt, die alle Links aus episode_links in das Format https://fyyd.de/episode/xxxxxxxxx/ umwandelt. Dabei werden Links, die mit /episode/ beginnen, entsprechend ergänzt, und Links mit /transcript werden bereinigt, um das gewünschte Format zu erreichen.

# prompt: reinige die strings in episode_links sodass aus '/episode/13289098',  '/episode/4591809',  '/episode/10661740',  'https://fyyd.de/episode/10661740/transcript#t4514',  '/episode/13287710',  '/episode/13289210', strings in folgendem format werden 'https://fyyd.de/episode/xxxxxxxxx/'

cleaned_episode_links = []
for link in episode_links:
    if link.startswith('/episode/'):
        cleaned_episode_links.append(f"https://fyyd.de{link}")
    elif "https://fyyd.de/episode/" in link:
        cleaned_episode_links.append(link.split("/transcript")[0] + "/") # handles transcript links
    else:
        cleaned_episode_links.append(link) # keep links that are already in correct format

len(cleaned_episode_links)

In dieser Zelle wird die Liste cleaned_episode_links in ein Set umgewandelt, um doppelte Einträge zu entfernen, und dann wieder in eine Liste konvertiert. Anschließend wird die Länge des Sets berechnet, um die Anzahl der eindeutigen Links in cleaned_episode_links zu ermitteln.

cleaned_episode_links = list(set(cleaned_episode_links))

len(set(cleaned_episode_links))

In dieser Zelle wird ein Python-Skript erstellt, das über alle Links in cleaned_episode_links iteriert und die HTML-Antworten in einem Ordner speichert. Für jeden Link wird eine HTTP-Anfrage gesendet, und die Antwort wird in einer Datei gespeichert, deren Name die Episoden-ID und einen Index enthält. Der Ordner episode_html wird erstellt, falls er noch nicht existiert, und Fehler während des Downloads werden behandelt und ausgegeben.

Zweites Requesting

# prompt: iteriere über alle links in cleaned_episode_links und speichere die html antworten in einem ordner füge die episode id als in den dateiname ein

import requests
from bs4 import BeautifulSoup
import os

# ... (your existing code) ...


def download_and_save_html(url, filename="downloaded_page.html"):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        with open(filename, "w", encoding="utf-8") as file:
            file.write(response.text)
        print(f"Page downloaded and saved as {filename}")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


# Create the directory if it doesn't exist
if not os.path.exists("episode_html"):
    os.makedirs("episode_html")

for i, link in enumerate(cleaned_episode_links):
    try:
        response = requests.get(link)
        response.raise_for_status()

        # Extract episode ID (you might need a more robust way to extract this)
        episode_id = link.split("/")[-2] if link.split("/")[-1] == "" else link.split("/")[-1] # handles trailing slashes

        filename = os.path.join("episode_html", f"episode_{episode_id}_{i}.html")  # Use episode ID in filename
        with open(filename, "w", encoding="utf-8") as f:
            f.write(response.text)
        print(f"Downloaded {link} to {filename}")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {link}: {e}")
    except Exception as e:
        print(f"An unexpected error occurred while processing {link}: {e}")

Scraping

In dieser Zelle wird ein Python-Skript erstellt, das über alle HTML-Dateien in einem angegebenen Verzeichnis (output_dir) iteriert, die Dateinamen sortiert und bestimmte Informationen aus den HTML-Inhalten extrahiert.

Das Skript beginnt mit dem Import der notwendigen Bibliotheken (pandas, os, BeautifulSoup und re). Es definiert eine Funktion process_html_files, die die HTML-Dateien im Verzeichnis episode_html verarbeitet.

Die Funktion listet alle HTML-Dateien im Verzeichnis auf und sortiert sie. Für jede Datei wird der HTML-Inhalt gelesen und mit BeautifulSoup geparst.

Es extrahiert den Titel der Seite aus dem -Tag und das Datum und die Uhrzeit aus einem -Tag mit einem title-Attribut, das das Wort “importiert” enthält. Die Episoden-ID wird aus dem Dateinamen extrahiert. Die extrahierten Daten (Titel, Datum und Uhrzeit, Episoden-ID) werden in einer Liste gespeichert und schließlich in ein Pandas DataFrame konvertiert. Das DataFrame wird zurückgegeben und kann weiter verwendet werden. <div id="cell-21" class="cell" data-outputId="efff9d49-a576-46e6-b2a0-cbe9f6b35f37"> <div class="sourceCode" id="cb7"><pre class="sourceCode python cell-code"><code class="sourceCode python"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a># prompt: iteriere über alle html dateien in output_dir , sortiere die liste der dateinamen und extrahiere <title>fyyd: Studio 9: Welche Chancen bringt die elektronische Patientenakte?</title> und das datum und die uhrzeit aus in ein pandas dataframe als title date episode id <a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>import pandas as pd <a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>import os <a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>from bs4 import BeautifulSoup <a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>import re <a href="#cb7-7" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-8" aria-hidden="true" tabindex="-1"></a>def process_html_files(output_dir="episode_html"): <a href="#cb7-9" aria-hidden="true" tabindex="-1"></a> """ <a href="#cb7-10" aria-hidden="true" tabindex="-1"></a> Iterates through HTML files, extracts title, date, and time, and creates a Pandas DataFrame. <a href="#cb7-11" aria-hidden="true" tabindex="-1"></a> """ <a href="#cb7-12" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-13" aria-hidden="true" tabindex="-1"></a> html_files = [f for f in os.listdir(output_dir) if f.endswith(".html")] <a href="#cb7-14" aria-hidden="true" tabindex="-1"></a> html_files.sort() # Sort filenames <a href="#cb7-15" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-16" aria-hidden="true" tabindex="-1"></a> data = [] <a href="#cb7-17" aria-hidden="true" tabindex="-1"></a> for filename in html_files: <a href="#cb7-18" aria-hidden="true" tabindex="-1"></a> filepath = os.path.join(output_dir, filename) <a href="#cb7-19" aria-hidden="true" tabindex="-1"></a> try: <a href="#cb7-20" aria-hidden="true" tabindex="-1"></a> with open(filepath, "r", encoding="utf-8") as file: <a href="#cb7-21" aria-hidden="true" tabindex="-1"></a> html_content = file.read() <a href="#cb7-22" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-23" aria-hidden="true" tabindex="-1"></a> soup = BeautifulSoup(html_content, "html.parser") <a href="#cb7-24" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-25" aria-hidden="true" tabindex="-1"></a> # Extract title using regular expressions for robustness <a href="#cb7-26" aria-hidden="true" tabindex="-1"></a> title_tag = soup.find("title") <a href="#cb7-27" aria-hidden="true" tabindex="-1"></a> if title_tag: <a href="#cb7-28" aria-hidden="true" tabindex="-1"></a> title_match = re.search(r"<title>(.*?)</title>", str(title_tag)) <a href="#cb7-29" aria-hidden="true" tabindex="-1"></a> title = title_match.group(1) if title_match else None <a href="#cb7-30" aria-hidden="true" tabindex="-1"></a> else: <a href="#cb7-31" aria-hidden="true" tabindex="-1"></a> title = None <a href="#cb7-32" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-33" aria-hidden="true" tabindex="-1"></a> # Extract date and time using find with span tag <a href="#cb7-34" aria-hidden="true" tabindex="-1"></a> date_time_span = soup.find("span", attrs={"title":re.compile(r"importiert")}) <a href="#cb7-35" aria-hidden="true" tabindex="-1"></a> if date_time_span and date_time_span["title"]: <a href="#cb7-36" aria-hidden="true" tabindex="-1"></a> date_time_str = date_time_span["title"].split(":", 1)[1] # split string at first ":" <a href="#cb7-37" aria-hidden="true" tabindex="-1"></a> # date_time_str = date_time_span["title"] <a href="#cb7-38" aria-hidden="true" tabindex="-1"></a> else: <a href="#cb7-39" aria-hidden="true" tabindex="-1"></a> date_time_str = None <a href="#cb7-40" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-41" aria-hidden="true" tabindex="-1"></a> episode_id = filename.split("_")[1] # extract episode ID from filename <a href="#cb7-42" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-43" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-44" aria-hidden="true" tabindex="-1"></a> data.append([title, date_time_str, episode_id]) <a href="#cb7-45" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-46" aria-hidden="true" tabindex="-1"></a> except FileNotFoundError: <a href="#cb7-47" aria-hidden="true" tabindex="-1"></a> print(f"Error: File '{filename}' not found.") <a href="#cb7-48" aria-hidden="true" tabindex="-1"></a> except Exception as e: <a href="#cb7-49" aria-hidden="true" tabindex="-1"></a> print(f"An unexpected error occurred while processing {filename}: {e}") <a href="#cb7-50" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-51" aria-hidden="true" tabindex="-1"></a> df = pd.DataFrame(data, columns=["title", "date", "id"]) <a href="#cb7-52" aria-hidden="true" tabindex="-1"></a> return df <a href="#cb7-53" aria-hidden="true" tabindex="-1"></a> <a href="#cb7-54" aria-hidden="true" tabindex="-1"></a># Example usage (assuming you have the episode_html directory) <a href="#cb7-55" aria-hidden="true" tabindex="-1"></a>df = process_html_files() <a href="#cb7-56" aria-hidden="true" tabindex="-1"></a>df</code></pre></div> </div> </section> <section id="cleaning-1" class="level2"> <h2>Cleaning</h2> In dieser Zelle wird der String “fyyd:” am Beginn jeder Zeile in der Spalte title des DataFrames df entfernt. Dies wird durch die Verwendung der str.replace-Methode erreicht, die den regulären Ausdruck ^fyyd: (der für “fyyd:” am Anfang eines Strings steht) durch einen leeren String ersetzt. Das Ergebnis wird in der Spalte title des DataFrames df gespeichert. <div id="cell-24" class="cell" data-outputId="7f8d131a-2c26-4c08-a8ec-eaf881e51315"> <div class="sourceCode" id="cb8"><pre class="sourceCode python cell-code"><code class="sourceCode python"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a># prompt: entferne das "fyyd: " am beginn jeder reihe in df.title <a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> <a href="#cb8-3" aria-hidden="true" tabindex="-1"></a># Remove "fyyd: " from the beginning of each row in df.title <a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>df['title'] = df['title'].str.replace(r'^fyyd: ', '', regex=True) <a href="#cb8-5" aria-hidden="true" tabindex="-1"></a> <a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>df</code></pre></div> </div> </section> <section id="datumsformatierung" class="level2"> <h2>Datumsformatierung</h2> In dieser Zelle wird die Spalte date im DataFrame df in Datetime-Objekte umgewandelt, wobei das Format %d.%m.%Y %H:%M verwendet wird. Fehlerhafte Datumsangaben werden dabei ignoriert (errors=‘coerce’). Anschließend wird der DataFrame nach der Spalte date in absteigender Reihenfolge sortiert, sodass die neuesten Daten zuerst erscheinen. Schließlich wird der aktualisierte DataFrame angezeigt. <div id="cell-27" class="cell" data-outputId="4884c18a-f38d-492e-db56-6f85b9bcd563"> <div class="sourceCode" id="cb9"><pre class="sourceCode python cell-code"><code class="sourceCode python"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a># prompt: Using dataframe df: formatiere die spalte date und sortiere nach date beginnend beim jüngsten datum <a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <a href="#cb9-3" aria-hidden="true" tabindex="-1"></a># Convert the 'date' column to datetime objects <a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y %H:%M', errors='coerce') <a href="#cb9-5" aria-hidden="true" tabindex="-1"></a> <a href="#cb9-6" aria-hidden="true" tabindex="-1"></a># Sort the DataFrame by the 'date' column in descending order (most recent first) <a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>df = df.sort_values(by='date', ascending=False) <a href="#cb9-8" aria-hidden="true" tabindex="-1"></a> <a href="#cb9-9" aria-hidden="true" tabindex="-1"></a># Display the updated DataFrame <a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>df</code></pre></div> </div> </section> <section id="zusätzliche-features" class="level2"> <h2>Zusätzliche Features</h2> In dieser Zelle wird eine neue Spalte url im DataFrame df erstellt, indem die Basis-URL https://fyyd.de/episode/ mit den Werten der Spalte id verkettet wird. Dadurch wird für jede Episode eine vollständige URL generiert. Anschließend wird der aktualisierte DataFrame angezeigt, um die Änderungen zu überprüfen. <div id="cell-30" class="cell" data-execution_count="10"> <div class="sourceCode" id="cb10"><pre class="sourceCode python cell-code"><code class="sourceCode python"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a># prompt: Using dataframe df: formatier die id als url mit folgendem format https://fyyd.de/episode/id <a href="#cb10-2" aria-hidden="true" tabindex="-1"></a> <a href="#cb10-3" aria-hidden="true" tabindex="-1"></a># Import necessary libraries (if not already imported) <a href="#cb10-4" aria-hidden="true" tabindex="-1"></a>import pandas as pd <a href="#cb10-5" aria-hidden="true" tabindex="-1"></a> <a href="#cb10-6" aria-hidden="true" tabindex="-1"></a># Assuming 'df' is your DataFrame <a href="#cb10-7" aria-hidden="true" tabindex="-1"></a> <a href="#cb10-8" aria-hidden="true" tabindex="-1"></a># Create the URL column by concatenating the base URL with the 'id' column. <a href="#cb10-9" aria-hidden="true" tabindex="-1"></a>df['url'] = 'https://fyyd.de/episode/' + df['id'] <a href="#cb10-10" aria-hidden="true" tabindex="-1"></a> <a href="#cb10-11" aria-hidden="true" tabindex="-1"></a># Display the updated DataFrame to verify the changes. <a href="#cb10-12" aria-hidden="true" tabindex="-1"></a>#print(df.head())</code></pre></div> </div> In dieser Zelle werden Duplikate aus dem DataFrame df entfernt. Dies geschieht durch die Methode drop_duplicates(), die alle Zeilen entfernt, die in allen Spalten identisch sind. Anschließend wird der aktualisierte DataFrame angezeigt. <div id="cell-32" class="cell" data-outputId="e4b481c0-b85d-4993-cb0f-d0514a17d085"> <div class="sourceCode" id="cb11"><pre class="sourceCode python cell-code"><code class="sourceCode python"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a># prompt: entferne duplikate von df <a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> <a href="#cb11-3" aria-hidden="true" tabindex="-1"></a># Assuming 'df' is your DataFrame <a href="#cb11-4" aria-hidden="true" tabindex="-1"></a> <a href="#cb11-5" aria-hidden="true" tabindex="-1"></a># Remove duplicates based on all columns <a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>df = df.drop_duplicates() <a href="#cb11-7" aria-hidden="true" tabindex="-1"></a> <a href="#cb11-8" aria-hidden="true" tabindex="-1"></a># Display the updated DataFrame <a href="#cb11-9" aria-hidden="true" tabindex="-1"></a>df</code></pre></div> </div> </section> <section id="export-als-markdown-tabelle" class="level2"> <h2>Export als Markdown Tabelle</h2> In dieser Zelle wird der DataFrame df neu indexiert und eine neue Spalte index hinzugefügt, die bei 1 beginnt. Die Spalte date wird so formatiert, dass nur das Datum ohne die Uhrzeit angezeigt wird. Anschließend wird eine Markdown-formatierte Tabelle erstellt, in der die Spalte title als Markdown-Link formatiert ist (title). Die resultierende Tabelle enthält die Spalten index, title und date. Schließlich wird die Markdown-Tabelle angezeigt und kann optional in eine Datei gespeichert werden. <div id="cell-35" class="cell" data-outputId="d709155c-3b45-4ecb-a0e6-857beed4f5b5"> <div class="sourceCode" id="cb12"><pre class="sourceCode python cell-code"><code class="sourceCode python"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a># prompt: reset the index and show as markdown formatted table mit title als markdown link [title](url) und date ohne uhrzeit, die url muss nicht als gesonderte spalte aufgeführt werden, füge einen index hinzu, die tabelle soll am ende die spalten index titel datum besitzen <a href="#cb12-2" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-3" aria-hidden="true" tabindex="-1"></a># Assuming 'df' is your DataFrame from the previous code <a href="#cb12-4" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-5" aria-hidden="true" tabindex="-1"></a># Reset the index and create a new 'index' column <a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>df = df.reset_index(drop=True) <a href="#cb12-7" aria-hidden="true" tabindex="-1"></a>df.index = df.index + 1 # Start index from 1 <a href="#cb12-8" aria-hidden="true" tabindex="-1"></a>df = df.rename_axis('index') <a href="#cb12-9" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-10" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-11" aria-hidden="true" tabindex="-1"></a># Format the 'date' column to only show the date without the time <a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>df['date'] = df['date'].dt.strftime('%d.%m.%Y') <a href="#cb12-13" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-14" aria-hidden="true" tabindex="-1"></a># Create the Markdown-formatted table <a href="#cb12-15" aria-hidden="true" tabindex="-1"></a>markdown_table = "| index | title | date |\n" <a href="#cb12-16" aria-hidden="true" tabindex="-1"></a>markdown_table += "|---|---|---|\n" # Separator row <a href="#cb12-17" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-18" aria-hidden="true" tabindex="-1"></a>for index, row in df.iterrows(): <a href="#cb12-19" aria-hidden="true" tabindex="-1"></a> title_link = f"[{row['title']}]({row['url']})" <a href="#cb12-20" aria-hidden="true" tabindex="-1"></a> markdown_table += f"| {index} | {title_link} | {row['date']} |\n" <a href="#cb12-21" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-22" aria-hidden="true" tabindex="-1"></a># Display the markdown table <a href="#cb12-23" aria-hidden="true" tabindex="-1"></a>markdown_table <a href="#cb12-24" aria-hidden="true" tabindex="-1"></a> <a href="#cb12-25" aria-hidden="true" tabindex="-1"></a># You can also save the markdown table to a file if you want: <a href="#cb12-26" aria-hidden="true" tabindex="-1"></a># with open("output.md", "w") as f: <a href="#cb12-27" aria-hidden="true" tabindex="-1"></a># f.write(markdown_table)</code></pre></div> </div> </section> <section id="fazit" class="level2"> <h2>Fazit</h2> Dieses Notebook zeigt, wie man Podcast-Episoden von Webseiten herunterlädt, die Daten bereinigt und in ein einheitliches Format bringt, um sie schließlich in einer Markdown-Tabelle darzustellen. Es bietet eine umfassende Anleitung zur Datenextraktion, -verarbeitung und -formatierung mit Python. Der Nutzen dieses Notebooks liegt in der Automatisierung und Vereinfachung des Prozesses, um strukturierte Daten aus unstrukturierten Webinhalten zu gewinnen und übersichtlich darzustellen. <div id="quarto-navigation-envelope" class="hidden"> Use-Case 🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht Buergergeld Forum💬 Anwendungsfall Übersicht Aktualitätendienst Gesetze📜 Jobbörse💼 Buergergeld Forum💬 Anwendungsfall Bonus Podcasts aggregieren 1️⃣ Start /index.html 2️⃣ No Code /basics.html 3️⃣ Low Code /low_code.html 4️⃣ Anwendungsfall /use_case.html Anwendungsfall Bonus Podcasts aggregieren </div> <div id="quarto-meta-markdown" class="hidden"> 🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht - Podcast-Episoden aggregieren 🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht - Podcast-Episoden aggregieren 🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht - Podcast-Episoden aggregieren 🧭 Einstieg ins Web Scraping - Daten sammeln leicht gemacht Podcast-Episoden von Webseiten herunterladen, die Daten analysieren und bereinigen und schließlich in Markdown exportieren. Podcast-Episoden von Webseiten herunterladen, die Daten analysieren und bereinigen und schließlich in Markdown exportieren. </div> </section> </main>  <script id = "quarto-html-after-body" type="application/javascript"> window.document.addEventListener("DOMContentLoaded", function (event) { const toggleBodyColorMode = (bsSheetEl) => { const mode = bsSheetEl.getAttribute("data-mode"); const bodyEl = window.document.querySelector("body"); if (mode === "dark") { bodyEl.classList.add("quarto-dark"); bodyEl.classList.remove("quarto-light"); } else { bodyEl.classList.add("quarto-light"); bodyEl.classList.remove("quarto-dark"); } } const toggleBodyColorPrimary = () => { const bsSheetEl = window.document.querySelector("link#quarto-bootstrap"); if (bsSheetEl) { toggleBodyColorMode(bsSheetEl); } } toggleBodyColorPrimary(); const icon = ""; const anchorJS = new window.AnchorJS(); anchorJS.options = { placement: 'right', icon: icon }; anchorJS.add('.anchored'); const isCodeAnnotation = (el) => { for (const clz of el.classList) { if (clz.startsWith('code-annotation-')) { return true; } } return false; } const clipboard = new window.ClipboardJS('.code-copy-button', { text: function(trigger) { const codeEl = trigger.previousElementSibling.cloneNode(true); for (const childEl of codeEl.children) { if (isCodeAnnotation(childEl)) { childEl.remove(); } } return codeEl.innerText; } }); clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus button.blur(); // flash "checked" button.classList.add('code-copy-button-checked'); var currentTitle = button.getAttribute("title"); button.setAttribute("title", "Copied!"); let tooltip; if (window.bootstrap) { button.setAttribute("data-bs-toggle", "tooltip"); button.setAttribute("data-bs-placement", "left"); button.setAttribute("data-bs-title", "Copied!"); tooltip = new bootstrap.Tooltip(button, { trigger: "manual", customClass: "code-copy-button-tooltip", offset: [0, -8]}); tooltip.show(); } setTimeout(function() { if (tooltip) { tooltip.hide(); button.removeAttribute("data-bs-title"); button.removeAttribute("data-bs-toggle"); button.removeAttribute("data-bs-placement"); } button.setAttribute("title", currentTitle); button.classList.remove('code-copy-button-checked'); }, 1000); // clear code selection e.clearSelection(); }); function tippyHover(el, contentFn, onTriggerFn, onUntriggerFn) { const config = { allowHTML: true, maxWidth: 500, delay: 100, arrow: false, appendTo: function(el) { return el.parentElement; }, interactive: true, interactiveBorder: 10, theme: 'quarto', placement: 'bottom-start', }; if (contentFn) { config.content = contentFn; } if (onTriggerFn) { config.onTrigger = onTriggerFn; } if (onUntriggerFn) { config.onUntrigger = onUntriggerFn; } window.tippy(el, config); } const noterefs = window.document.querySelectorAll('a[role="doc-noteref"]'); for (var i=0; i<noterefs.length; i++) { const ref = noterefs[i]; tippyHover(ref, function() { // use id or data attribute instead here let href = ref.getAttribute('data-footnote-href') || ref.getAttribute('href'); try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); return note.innerHTML; }); } const xrefs = window.document.querySelectorAll('a.quarto-xref'); const processXRef = (id, note) => { // Strip column container classes const stripColumnClz = (el) => { el.classList.remove("page-full", "page-columns"); if (el.children) { for (const child of el.children) { stripColumnClz(child); } } } stripColumnClz(note) if (id === null || id.startsWith('sec-')) { // Special case sections, only their first couple elements const container = document.createElement("div"); if (note.children && note.children.length > 2) { container.appendChild(note.children[0].cloneNode(true)); for (let i = 1; i < note.children.length; i++) { const child = note.children[i]; if (child.tagName === "P" && child.innerText === "") { continue; } else { container.appendChild(child.cloneNode(true)); break; } } if (window.Quarto?.typesetMath) { window.Quarto.typesetMath(container); } return container.innerHTML } else { if (window.Quarto?.typesetMath) { window.Quarto.typesetMath(note); } return note.innerHTML; } } else { // Remove any anchor links if they are present const anchorLink = note.querySelector('a.anchorjs-link'); if (anchorLink) { anchorLink.remove(); } if (window.Quarto?.typesetMath) { window.Quarto.typesetMath(note); } // TODO in 1.5, we should make sure this works without a callout special case if (note.classList.contains("callout")) { return note.outerHTML; } else { return note.innerHTML; } } } for (var i=0; i<xrefs.length; i++) { const xref = xrefs[i]; tippyHover(xref, undefined, function(instance) { instance.disable(); let url = xref.getAttribute('href'); let hash = undefined; if (url.startsWith('#')) { hash = url; } else { try { hash = new URL(url).hash; } catch {} } if (hash) { const id = hash.replace(/^#\/?/, ""); const note = window.document.getElementById(id); if (note !== null) { try { const html = processXRef(id, note.cloneNode(true)); instance.setContent(html); } finally { instance.enable(); instance.show(); } } else { // See if we can fetch this fetch(url.split('#')[0]) .then(res => res.text()) .then(html => { const parser = new DOMParser(); const htmlDoc = parser.parseFromString(html, "text/html"); const note = htmlDoc.getElementById(id); if (note !== null) { const html = processXRef(id, note); instance.setContent(html); } }).finally(() => { instance.enable(); instance.show(); }); } } else { // See if we can fetch a full url (with no hash to target) // This is a special case and we should probably do some content thinning / targeting fetch(url) .then(res => res.text()) .then(html => { const parser = new DOMParser(); const htmlDoc = parser.parseFromString(html, "text/html"); const note = htmlDoc.querySelector('main.content'); if (note !== null) { // This should only happen for chapter cross references // (since there is no id in the URL) // remove the first header if (note.children.length > 0 && note.children[0].tagName === "HEADER") { note.children[0].remove(); } const html = processXRef(null, note); instance.setContent(html); } }).finally(() => { instance.enable(); instance.show(); }); } }, function(instance) { }); } let selectedAnnoteEl; const selectorForAnnotation = ( cell, annotation) => { let cellAttr = 'data-code-cell="' + cell + '"'; let lineAttr = 'data-code-annotation="' + annotation + '"'; const selector = 'span[' + cellAttr + '][' + lineAttr + ']'; return selector; } const selectCodeLines = (annoteEl) => { const doc = window.document; const targetCell = annoteEl.getAttribute("data-target-cell"); const targetAnnotation = annoteEl.getAttribute("data-target-annotation"); const annoteSpan = window.document.querySelector(selectorForAnnotation(targetCell, targetAnnotation)); const lines = annoteSpan.getAttribute("data-code-lines").split(","); const lineIds = lines.map((line) => { return targetCell + "-" + line; }) let top = null; let height = null; let parent = null; if (lineIds.length > 0) { //compute the position of the single el (top and bottom and make a div) const el = window.document.getElementById(lineIds[0]); top = el.offsetTop; height = el.offsetHeight; parent = el.parentElement.parentElement; if (lineIds.length > 1) { const lastEl = window.document.getElementById(lineIds[lineIds.length - 1]); const bottom = lastEl.offsetTop + lastEl.offsetHeight; height = bottom - top; } if (top !== null && height !== null && parent !== null) { // cook up a div (if necessary) and position it let div = window.document.getElementById("code-annotation-line-highlight"); if (div === null) { div = window.document.createElement("div"); div.setAttribute("id", "code-annotation-line-highlight"); div.style.position = 'absolute'; parent.appendChild(div); } div.style.top = top - 2 + "px"; div.style.height = height + 4 + "px"; div.style.left = 0; let gutterDiv = window.document.getElementById("code-annotation-line-highlight-gutter"); if (gutterDiv === null) { gutterDiv = window.document.createElement("div"); gutterDiv.setAttribute("id", "code-annotation-line-highlight-gutter"); gutterDiv.style.position = 'absolute'; const codeCell = window.document.getElementById(targetCell); const gutter = codeCell.querySelector('.code-annotation-gutter'); gutter.appendChild(gutterDiv); } gutterDiv.style.top = top - 2 + "px"; gutterDiv.style.height = height + 4 + "px"; } selectedAnnoteEl = annoteEl; } }; const unselectCodeLines = () => { const elementsIds = ["code-annotation-line-highlight", "code-annotation-line-highlight-gutter"]; elementsIds.forEach((elId) => { const div = window.document.getElementById(elId); if (div) { div.remove(); } }); selectedAnnoteEl = undefined; }; // Handle positioning of the toggle window.addEventListener( "resize", throttle(() => { elRect = undefined; if (selectedAnnoteEl) { selectCodeLines(selectedAnnoteEl); } }, 10) ); function throttle(fn, ms) { let throttle = false; let timer; return (...args) => { if(!throttle) { // first call gets through fn.apply(this, args); throttle = true; } else { // all the others get throttled if(timer) clearTimeout(timer); // cancel #2 timer = setTimeout(() => { fn.apply(this, args); timer = throttle = false; }, ms); } }; } // Attach click handler to the DT const annoteDls = window.document.querySelectorAll('dt[data-target-cell]'); for (const annoteDlNode of annoteDls) { annoteDlNode.addEventListener('click', (event) => { const clickedEl = event.target; if (clickedEl !== selectedAnnoteEl) { unselectCodeLines(); const activeEl = window.document.querySelector('dt[data-target-cell].code-annotation-active'); if (activeEl) { activeEl.classList.remove('code-annotation-active'); } selectCodeLines(clickedEl); clickedEl.classList.add('code-annotation-active'); } else { // Unselect the line unselectCodeLines(); clickedEl.classList.remove('code-annotation-active'); } }); } const findCites = (el) => { const parentEl = el.parentElement; if (parentEl) { const cites = parentEl.dataset.cites; if (cites) { return { el, cites: cites.split(' ') }; } else { return findCites(el.parentElement) } } else { return undefined; } }; var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); for (var i=0; i<bibliorefs.length; i++) { const ref = bibliorefs[i]; const citeInfo = findCites(ref); if (citeInfo) { tippyHover(citeInfo.el, function() { var popup = window.document.createElement('div'); citeInfo.cites.forEach(function(cite) { var citeDiv = window.document.createElement('div'); citeDiv.classList.add('hanging-indent'); citeDiv.classList.add('csl-entry'); var biblioDiv = window.document.getElementById('ref-' + cite); if (biblioDiv) { citeDiv.innerHTML = biblioDiv.innerHTML; } popup.appendChild(citeDiv); }); return popup.innerHTML; }); } } }); </script> <nav class="page-navigation"> <div class="nav-page nav-page-previous"> <a href="/04_use_case/forum/buergergeld_forum.html" class="pagination-link" aria-label="Buergergeld Forum💬"> Buergergeld Forum💬 </a> </div> <div class="nav-page nav-page-next"> </div> </nav> </div>  </body> </html>