# Content Parser


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

``` python
from fastcore.test import test_eq
```

## metadata Parsing

Utilities for extracting frontmatter metadata from markdown and Jupyter
notebook files.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L23"
target="_blank" style="float:right; font-size:smaller">source</a>

### parse_metadata

``` python

def parse_metadata(
    content:str, # Raw markdown content with YAML frontmatter
)->dict:

```

*Extract metadata from YAML frontmatter.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L33"
target="_blank" style="float:right; font-size:smaller">source</a>

### parse_notebook_metadata

``` python

def parse_notebook_metadata(
    content:str, # Raw Jupyter notebook JSON string
)->dict:

```

*Extract metadata from the first cell of a Jupyter notebook.*

``` python
# Test Parse metadat
from pathlib import Path

sample_dir = Path("sample")
if not sample_dir.exists():
    sample_dir = Path("../sample")

with open(sample_dir / "example.md", "r") as file:
    content = file.read()

metadata = parse_metadata(content)

# content is .ipynb
with open(sample_dir / "design_questions.ipynb", "r") as f:
    nb_content = f.read()
nb_metadata = parse_notebook_metadata(nb_content)
print(nb_metadata)
```

    {}

## Notebook Parsing

Helpers for filtering and extracting content from Jupyter notebook
cells.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L47"
target="_blank" style="float:right; font-size:smaller">source</a>

### is_frontmatter

``` python

def is_frontmatter(
    cell:dict, # Notebook cell dict
)->bool:

```

*Check if a cell is a YAML frontmatter cell.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L55"
target="_blank" style="float:right; font-size:smaller">source</a>

### is_visible_code

``` python

def is_visible_code(
    cell:dict, # Notebook cell dict
    is_quarto:bool=False, # Whether the notebook is a Quarto doc
)->bool:

```

*Check if a code cell should be included in output.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L66"
target="_blank" style="float:right; font-size:smaller">source</a>

### extract_notebook_content

``` python

def extract_notebook_content(
    content:str, # Raw Jupyter notebook JSON string
    is_quarto:bool=False, # Whether the notebook is a Quarto doc
)->str:

```

*Extract visible markdown and code content from a Jupyter notebook.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L76"
target="_blank" style="float:right; font-size:smaller">source</a>

### remove_metadata

``` python

def remove_metadata(
    content:str
)->str:

```

*Remove frontmatter from content*

## Content Checks

Functions for validating SEO-relevant content lengths for titles,
descriptions, and body text.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L124"
target="_blank" style="float:right; font-size:smaller">source</a>

### check_content_length

``` python

def check_content_length(
    content:str, # Page body content
)->dict:

```

*Check if page content has sufficient word count for SEO.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L107"
target="_blank" style="float:right; font-size:smaller">source</a>

### check_desc_length

``` python

def check_desc_length(
    description:str, # Meta description to check
    min_len:int=150, # Minimum optimal length
    max_len:int=160, # Maximum optimal length
)->dict:

```

*Check if a meta description length is optimal for SEO.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L92"
target="_blank" style="float:right; font-size:smaller">source</a>

### check_title_length

``` python

def check_title_length(
    title:str, # Page title to check
    min_len:int=50, # Minimum optimal length
    max_len:int=60, # Maximum optimal length
)->dict:

```

*Check if a page title length is optimal for SEO.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L83"
target="_blank" style="float:right; font-size:smaller">source</a>

### check_length

``` python

def check_length(
    text:str, # Text to check
    min_len:int, # Minimum optimal length
    max_len:int, # Maximum optimal length
)->dict:

```

*Check if text length falls within the optimal range.*

## Content Extraction

Functions for extracting headers, links, and images from markdown and
HTML content.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L132"
target="_blank" style="float:right; font-size:smaller">source</a>

### extract_headers

``` python

def extract_headers(
    file_path:str, # Path to markdown file
)->list:

```

*Extract all headers with their level, line number, content, and
length.*

``` python
headers = extract_headers(str(sample_dir / "example.md"))
test_eq(len([h for h in headers if h["type"] == "h1"]), 2)
test_eq(headers[0]["content"], "This is me Kareem")
```

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L150"
target="_blank" style="float:right; font-size:smaller">source</a>

### extract_links

``` python

def extract_links(
    content:str
)->dict:

```

*Extract all links with metadata*

``` python
links = extract_links(content)
test_eq("https://emdadelgaz.com" in links, True)
test_eq("https://awazly.com/" in links, True)
```

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L184"
target="_blank" style="float:right; font-size:smaller">source</a>

### extract_images

``` python

def extract_images(
    content:str, # Markdown or HTML content
)->list:

```

*Extract images with alt text from markdown and HTML.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L199"
target="_blank" style="float:right; font-size:smaller">source</a>

### imgs_missing_alts

``` python

def imgs_missing_alts(
    images:list, # List of image dicts from [`extract_images`](https://abdelkareemkobo.github.io/seo_rat/content_parser.html#extract_images)
)->list:

```

*Return URLs of images missing alt text.*

``` python
images = extract_images(content)
test_eq(len(images), 1)
test_eq(images[0]["alt_text"], "Iron man photo")
```

## Link Filtering

Utilities for classifying and filtering internal and external links.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L216"
target="_blank" style="float:right; font-size:smaller">source</a>

### filter_internal_links

``` python

def filter_internal_links(
    urls:list, # List of URLs to filter
    domain:str, # Domain to match against (e.g. 'example.com')
)->list:

```

*Filter for internal links, excluding images and special URLs.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L210"
target="_blank" style="float:right; font-size:smaller">source</a>

### is_special_url

``` python

def is_special_url(
    url:str, # URL to check
)->bool:

```

*Check if URL is an anchor, mailto, tel, or javascript link.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L227"
target="_blank" style="float:right; font-size:smaller">source</a>

### filter_external_links

``` python

def filter_external_links(
    urls:list, # List of URLs to filter
    domain:str, # Site domain to exclude
)->list:

```

*Filter for external links only, excluding images and special URLs.*

## Text Utilities

General-purpose text normalization and analysis helpers.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L237"
target="_blank" style="float:right; font-size:smaller">source</a>

### normalize_text

``` python

def normalize_text(
    text:str, # Text to normalize
)->str:

```

*Normalize text by collapsing extra whitespace.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L244"
target="_blank" style="float:right; font-size:smaller">source</a>

### detect_phone_numbers

``` python

def detect_phone_numbers(
    text:str, # Text to search
)->list:

```

*Extract phone numbers from text.*

``` python
phones = detect_phone_numbers(content)
test_eq("+966503139675" in phones, True)
```

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L252"
target="_blank" style="float:right; font-size:smaller">source</a>

### calculate_similarity

``` python

def calculate_similarity(
    text1:str, # First text
    text2:str, # Second text
)->float:

```

*Calculate similarity ratio between two texts using SequenceMatcher.*

## File Utilities

Helpers for finding and reading markdown and notebook files.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L261"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_file_paths

``` python

def get_file_paths(
    pattern:str, # Glob pattern (e.g. '**/*.md')
)->list:

```

*Get file paths matching a glob pattern.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L269"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_file_name

``` python

def get_file_name(
    file_path:str, # Path to file
)->str:

```

*Extract filename without extension from a path.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L284"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_page_content

``` python

def get_page_content(
    file_path:str, # Path to markdown or notebook file
    is_quarto:bool=False, # Whether the notebook is a Quarto doc
)->str:

```

*Read a file and return its text content, stripping frontmatter.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L276"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_markdown_files

``` python

def get_markdown_files(
    directory:str, # Directory to search
)->list:

```

*Get all markdown filenames (without extension) from a directory.*

## Arabic Slug Utilities

Helpers for converting Arabic filenames to URL-friendly slugs.

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L294"
target="_blank" style="float:right; font-size:smaller">source</a>

### arabic_to_slug

``` python

def arabic_to_slug(
    text:str, # Arabic text to convert
)->str:

```

*Convert Arabic text to a URL-friendly slug.*

------------------------------------------------------------------------

<a
href="https://github.com/abdelkareemkobo/seo_rat/blob/main/seo_rat/content_parser.py#L306"
target="_blank" style="float:right; font-size:smaller">source</a>

### map_files_to_slugs

``` python

def map_files_to_slugs(
    directory:str, # Directory containing Arabic markdown files
)->dict:

```

*Map markdown filenames to their URL slugs.*
