mbudisic commited on
Commit
6a9e2f3
·
1 Parent(s): 110545b

feat: Add robust HTML title extraction functionality

Browse files

- Introduced `get_title_streaming(url)` function to fetch and extract page titles from URLs using common HTML conventions.
- Documented the function in DEVELOPER.md, including usage examples and requirements for BeautifulSoup.
- Enhanced the clarity and engagement of the documentation with lighthearted commentary.

Files changed (1) hide show
  1. docs/DEVELOPER.md +34 -1
docs/DEVELOPER.md CHANGED
@@ -395,4 +395,37 @@ User Query: "What are blend modes?"
395
  System: "Do you allow Internet search for query 'What are blend modes?'? Answer 'yes' will perform the search, any other answer will skip it."
396
  User: "no"
397
  System: [Skips web search, continues with local RAG only]
398
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
395
  System: "Do you allow Internet search for query 'What are blend modes?'? Answer 'yes' will perform the search, any other answer will skip it."
396
  User: "no"
397
  System: [Skips web search, continues with local RAG only]
398
+ ```
399
+
400
+ ## 🛠️ Robust HTML Title Extraction
401
+
402
+ ### `get_title_streaming(url)`
403
+
404
+ This function fetches the HTML from a URL and extracts the page title using all the most common conventions, in this order:
405
+
406
+ 1. `<meta property="og:title" content="...">` (Open Graph, for social sharing)
407
+ 2. `<meta name="twitter:title" content="...">` (Twitter Cards)
408
+ 3. `<meta name="title" content="...">` (sometimes used for SEO)
409
+ 4. `<title>...</title>` (the classic HTML title tag)
410
+
411
+ It returns the **first** found value as a string, or `None` if no title is found. All extraction is done with BeautifulSoup for maximum reliability and standards compliance.
412
+
413
+ #### Example usage:
414
+ ```python
415
+ from pstuts_rag.utils import get_title_streaming
416
+ url = "https://example.com"
417
+ title = get_title_streaming(url)
418
+ print(title) # Prints the best available title, or None
419
+ ```
420
+
421
+ ---
422
+
423
+ ### 🥣 Requirements
424
+ - This function requires `beautifulsoup4` to be installed:
425
+ ```bash
426
+ pip install beautifulsoup4
427
+ ```
428
+
429
+ ---
430
+
431
+ > "A page by any other name would still be as sweet... but it's nice to get the right one!" 😄