Karim shoair commited on
Commit ·
5ba380b
1
Parent(s): eca1c09
docs: update the migrating from BS article
Browse files
docs/tutorials/migrating_from_beautifulsoup.md
CHANGED
|
@@ -1,10 +1,10 @@
|
|
| 1 |
# Migrating from BeautifulSoup to Scrapling
|
| 2 |
|
| 3 |
-
If you're already familiar with BeautifulSoup, you're in for a treat. Scrapling is
|
| 4 |
|
| 5 |
-
Below is a table that covers the most common operations you'll perform when scraping web pages. Each row illustrates how to
|
| 6 |
|
| 7 |
-
You will notice that some shortcuts in BeautifulSoup are missing in Scrapling,
|
| 8 |
|
| 9 |
|
| 10 |
| Task | BeautifulSoup Code | Scrapling Code |
|
|
@@ -56,7 +56,7 @@ Here's a simple example of scraping a web page to extract all the links using Be
|
|
| 56 |
import requests
|
| 57 |
from bs4 import BeautifulSoup
|
| 58 |
|
| 59 |
-
url = '
|
| 60 |
response = requests.get(url)
|
| 61 |
soup = BeautifulSoup(response.text, 'html.parser')
|
| 62 |
|
|
@@ -70,7 +70,7 @@ for link in links:
|
|
| 70 |
```python
|
| 71 |
from scrapling import Fetcher
|
| 72 |
|
| 73 |
-
url = '
|
| 74 |
page = Fetcher.get(url)
|
| 75 |
|
| 76 |
links = page.css('a::attr(href)')
|
|
@@ -78,12 +78,12 @@ for link in links:
|
|
| 78 |
print(link)
|
| 79 |
```
|
| 80 |
|
| 81 |
-
As you can see, Scrapling simplifies the process by
|
| 82 |
|
| 83 |
**Additional Notes:**
|
| 84 |
|
| 85 |
- **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
|
| 86 |
-
- **Element Types**: In BeautifulSoup, elements are `Tag` objects
|
| 87 |
- **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.css_first()`). To avoid errors, check for `None` before accessing properties.
|
| 88 |
- **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can help remove extra whitespace, consecutive spaces, or unwanted characters. Please check out the documentation for the complete list.
|
| 89 |
|
|
|
|
| 1 |
# Migrating from BeautifulSoup to Scrapling
|
| 2 |
|
| 3 |
+
If you're already familiar with BeautifulSoup, you're in for a treat. Scrapling is much faster, provides the same parsing capabilities as BS, adds additional parsing capabilities not found in BS, and introduces powerful new features for fetching and handling modern web pages. This guide will help you quickly adapt your existing BeautifulSoup code to leverage Scrapling's capabilities.
|
| 4 |
|
| 5 |
+
Below is a table that covers the most common operations you'll perform when scraping web pages. Each row illustrates how to achieve a specific task using BeautifulSoup and the corresponding method in Scrapling.
|
| 6 |
|
| 7 |
+
You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, which is one of the reasons BeautifulSoup is slower than Scrapling. The point is: If the same feature can be used in a short one-liner, there is no need to sacrifice performance to shorten that short line :)
|
| 8 |
|
| 9 |
|
| 10 |
| Task | BeautifulSoup Code | Scrapling Code |
|
|
|
|
| 56 |
import requests
|
| 57 |
from bs4 import BeautifulSoup
|
| 58 |
|
| 59 |
+
url = 'https://example.com'
|
| 60 |
response = requests.get(url)
|
| 61 |
soup = BeautifulSoup(response.text, 'html.parser')
|
| 62 |
|
|
|
|
| 70 |
```python
|
| 71 |
from scrapling import Fetcher
|
| 72 |
|
| 73 |
+
url = 'https://example.com'
|
| 74 |
page = Fetcher.get(url)
|
| 75 |
|
| 76 |
links = page.css('a::attr(href)')
|
|
|
|
| 78 |
print(link)
|
| 79 |
```
|
| 80 |
|
| 81 |
+
As you can see, Scrapling simplifies the process by combining fetching and parsing into a single step, making your code cleaner and more efficient.
|
| 82 |
|
| 83 |
**Additional Notes:**
|
| 84 |
|
| 85 |
- **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
|
| 86 |
+
- **Element Types**: In BeautifulSoup, elements are `Tag` objects; in Scrapling, they are `Selector` objects. However, they provide similar methods and properties for navigation and data extraction.
|
| 87 |
- **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.css_first()`). To avoid errors, check for `None` before accessing properties.
|
| 88 |
- **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can help remove extra whitespace, consecutive spaces, or unwanted characters. Please check out the documentation for the complete list.
|
| 89 |
|