Spaces:
Sleeping
Sleeping
feat: Add robust HTML title extraction functionality
Browse files- Introduced `get_title_streaming(url)` function to fetch and extract page titles from URLs using common HTML conventions.
- Documented the function in DEVELOPER.md, including usage examples and requirements for BeautifulSoup.
- Enhanced the clarity and engagement of the documentation with lighthearted commentary.
- docs/DEVELOPER.md +34 -1
docs/DEVELOPER.md
CHANGED
|
@@ -395,4 +395,37 @@ User Query: "What are blend modes?"
|
|
| 395 |
System: "Do you allow Internet search for query 'What are blend modes?'? Answer 'yes' will perform the search, any other answer will skip it."
|
| 396 |
User: "no"
|
| 397 |
System: [Skips web search, continues with local RAG only]
|
| 398 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 395 |
System: "Do you allow Internet search for query 'What are blend modes?'? Answer 'yes' will perform the search, any other answer will skip it."
|
| 396 |
User: "no"
|
| 397 |
System: [Skips web search, continues with local RAG only]
|
| 398 |
+
```
|
| 399 |
+
|
| 400 |
+
## 🛠️ Robust HTML Title Extraction
|
| 401 |
+
|
| 402 |
+
### `get_title_streaming(url)`
|
| 403 |
+
|
| 404 |
+
This function fetches the HTML from a URL and extracts the page title using all the most common conventions, in this order:
|
| 405 |
+
|
| 406 |
+
1. `<meta property="og:title" content="...">` (Open Graph, for social sharing)
|
| 407 |
+
2. `<meta name="twitter:title" content="...">` (Twitter Cards)
|
| 408 |
+
3. `<meta name="title" content="...">` (sometimes used for SEO)
|
| 409 |
+
4. `<title>...</title>` (the classic HTML title tag)
|
| 410 |
+
|
| 411 |
+
It returns the **first** found value as a string, or `None` if no title is found. All extraction is done with BeautifulSoup for maximum reliability and standards compliance.
|
| 412 |
+
|
| 413 |
+
#### Example usage:
|
| 414 |
+
```python
|
| 415 |
+
from pstuts_rag.utils import get_title_streaming
|
| 416 |
+
url = "https://example.com"
|
| 417 |
+
title = get_title_streaming(url)
|
| 418 |
+
print(title) # Prints the best available title, or None
|
| 419 |
+
```
|
| 420 |
+
|
| 421 |
+
---
|
| 422 |
+
|
| 423 |
+
### 🥣 Requirements
|
| 424 |
+
- This function requires `beautifulsoup4` to be installed:
|
| 425 |
+
```bash
|
| 426 |
+
pip install beautifulsoup4
|
| 427 |
+
```
|
| 428 |
+
|
| 429 |
+
---
|
| 430 |
+
|
| 431 |
+
> "A page by any other name would still be as sweet... but it's nice to get the right one!" 😄
|