Spaces:

mbudisic
/

PsTuts-RAG

Sleeping

mbudisic commited on Jun 4

Commit

6a9e2f3

1 Parent(s): 110545b

feat: Add robust HTML title extraction functionality

- Introduced `get_title_streaming(url)` function to fetch and extract page titles from URLs using common HTML conventions.
- Documented the function in DEVELOPER.md, including usage examples and requirements for BeautifulSoup.
- Enhanced the clarity and engagement of the documentation with lighthearted commentary.

Files changed (1) hide show

docs/DEVELOPER.md +34 -1

docs/DEVELOPER.md CHANGED Viewed

@@ -395,4 +395,37 @@ User Query: "What are blend modes?"
 System: "Do you allow Internet search for query 'What are blend modes?'? Answer 'yes' will perform the search, any other answer will skip it."
 User: "no"
 System: [Skips web search, continues with local RAG only]
-```

 System: "Do you allow Internet search for query 'What are blend modes?'? Answer 'yes' will perform the search, any other answer will skip it."
 User: "no"
 System: [Skips web search, continues with local RAG only]
+```
+## 🛠️ Robust HTML Title Extraction
+### `get_title_streaming(url)`
+This function fetches the HTML from a URL and extracts the page title using all the most common conventions, in this order:
+1. `<meta property="og:title" content="...">` (Open Graph, for social sharing)
+2. `<meta name="twitter:title" content="...">` (Twitter Cards)
+3. `<meta name="title" content="...">` (sometimes used for SEO)
+4. `<title>...</title>` (the classic HTML title tag)
+It returns the **first** found value as a string, or `None` if no title is found. All extraction is done with BeautifulSoup for maximum reliability and standards compliance.
+#### Example usage:
+```python
+from pstuts_rag.utils import get_title_streaming
+url = "https://example.com"
+title = get_title_streaming(url)
+print(title)  # Prints the best available title, or None
+```
+---
+### 🥣 Requirements
+- This function requires `beautifulsoup4` to be installed:
+  ```bash
+  pip install beautifulsoup4
+  ```
+---
+> "A page by any other name would still be as sweet... but it's nice to get the right one!" 😄