Karim shoair commited on
Commit
0f8d78f
·
1 Parent(s): 0c327d2

docs: add the page for the `extract` command

Browse files
Files changed (1) hide show
  1. docs/cli/extract-commands.md +348 -0
docs/cli/extract-commands.md ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Scrapling Extract Command Guide
2
+
3
+ **Web Scraping through the terminal without requiring any programming!**
4
+
5
+ The `scrapling extract` Command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
6
+
7
+ ## What is the Extract Command group?
8
+
9
+ The extract command is a set of simple terminal tools that:
10
+
11
+ - **Downloads web pages** and saves their content to files.
12
+ - **Converts HTML to readable formats** like Markdown, keeps it as HTML, or just extracts the text content of the page.
13
+ - **Supports custom CSS selectors** to extract specific parts of the page.
14
+ - **Handles HTTP requests and fetching through browsers**
15
+ - **Highly customizable** with custom headers, cookies, proxies, and the rest of the options. Almost all the options available through the code are also accessible through the command line.
16
+
17
+ ## Quick Start
18
+
19
+ - **Basic Website Download**
20
+
21
+ Download a website's text content as clean, readable text:
22
+ ```bash
23
+ scrapling extract get "https://example.com" page_content.txt
24
+ ```
25
+ This does an HTTP GET request and saves the text content of the webpage to `page_content.txt`.
26
+
27
+ - **Save as Different Formats**
28
+
29
+ Choose your output format by changing the file extension:
30
+ ```bash
31
+ # Convert the HTML content to Markdown, then save it to the file (great for documentation)
32
+ scrapling extract get "https://blog.example.com" article.md
33
+
34
+ # Save the HTML content as it is to the file
35
+ scrapling extract get "https://example.com" page.html
36
+
37
+ # Save a clean version of the text content of the webpage to the file
38
+ scrapling extract get "https://example.com" content.txt
39
+ ```
40
+
41
+ - **Extract Specific Content**
42
+
43
+ All commands can use CSS selectors to extract specific parts of the page through `--css-selector` or `-s` as you will see in the examples below.
44
+
45
+ ## Available Commands
46
+
47
+ You can display the available commands through `scrapling extract --help` to get the following list:
48
+ ```bash
49
+ Usage: scrapling extract [OPTIONS] COMMAND [ARGS]...
50
+
51
+ Fetch web pages using various fetchers and extract full/selected HTML content as HTML, Markdown, or extract text content.
52
+
53
+ Options:
54
+ --help Show this message and exit.
55
+
56
+ Commands:
57
+ get Perform a GET request and save the content to a file.
58
+ post Perform a POST request and save the content to a file.
59
+ put Perform a PUT request and save the content to a file.
60
+ delete Perform a DELETE request and save the content to a file.
61
+ fetch Use DynamicFetcher to fetch content with browser...
62
+ stealthy-fetch Use StealthyFetcher to fetch content with advanced...
63
+ ```
64
+
65
+ We will go through each Command in detail below.
66
+
67
+ ### HTTP Requests
68
+
69
+ 1. **GET Request**
70
+
71
+ The most common Command for downloading website content:
72
+
73
+ ```bash
74
+ scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS]
75
+ ```
76
+
77
+ **Examples:**
78
+ ```bash
79
+ # Basic download
80
+ scrapling extract get "https://news.site.com" news.md
81
+
82
+ # Download with custom timeout
83
+ scrapling extract get "https://example.com" content.txt --timeout 60
84
+
85
+ # Extract only specific content using CSS selectors
86
+ scrapling extract get "https://blog.example.com" articles.md --css-selector "article"
87
+
88
+ # Send a request with cookies
89
+ scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john"
90
+
91
+ # Add user agent
92
+ scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0"
93
+
94
+ # Add multiple headers
95
+ scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
96
+ ```
97
+ Get the available options for the Command with `scrapling extract get --help` as follows:
98
+ ```bash
99
+ Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILE
100
+
101
+ Perform a GET request and save the content to a file.
102
+
103
+ The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.
104
+
105
+ Options:
106
+ -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)
107
+ --cookies TEXT Cookies string in format "name1=value1;name2=value2"
108
+ --timeout INTEGER Request timeout in seconds (default: 30)
109
+ --proxy TEXT Proxy URL in format "http://username:password@host:port"
110
+ -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
111
+ -p, --params TEXT Query parameters in format "key=value" (can be used multiple times)
112
+ --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)
113
+ --verify / --no-verify Whether to verify SSL certificates (default: True)
114
+ --impersonate TEXT Browser to impersonate (e.g., chrome, firefox).
115
+ --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)
116
+ --help Show this message and exit.
117
+
118
+ ```
119
+ Note that the options will work in the same way for all other request commands, so no need to repeat them.
120
+
121
+ 2. **Post Request**
122
+
123
+ ```bash
124
+ scrapling extract post [URL] [OUTPUT_FILE] [OPTIONS]
125
+ ```
126
+
127
+ **Examples:**
128
+ ```bash
129
+ # Submit form data
130
+ scrapling extract post "https://api.site.com/search" results.html --data "query=python&type=tutorial"
131
+
132
+ # Send JSON data
133
+ scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}'
134
+ ```
135
+ Get the available options for the Command with `scrapling extract post --help` as follows:
136
+ ```bash
137
+ Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILE
138
+
139
+ Perform a POST request and save the content to a file.
140
+
141
+ The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.
142
+
143
+ Options:
144
+ -d, --data TEXT Form data to include in the request body (as string, ex: "param1=value1&param2=value2")
145
+ -j, --json TEXT JSON data to include in the request body (as string)
146
+ -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)
147
+ --cookies TEXT Cookies string in format "name1=value1;name2=value2"
148
+ --timeout INTEGER Request timeout in seconds (default: 30)
149
+ --proxy TEXT Proxy URL in format "http://username:password@host:port"
150
+ -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
151
+ -p, --params TEXT Query parameters in format "key=value" (can be used multiple times)
152
+ --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)
153
+ --verify / --no-verify Whether to verify SSL certificates (default: True)
154
+ --impersonate TEXT Browser to impersonate (e.g., chrome, firefox).
155
+ --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)
156
+ --help Show this message and exit.
157
+
158
+ ```
159
+
160
+ 3. **Put Request**
161
+
162
+ ```bash
163
+ scrapling extract put [URL] [OUTPUT_FILE] [OPTIONS]
164
+ ```
165
+
166
+ **Examples:**
167
+ ```bash
168
+ # Send data
169
+ scrapling extract put "https://scrapling.requestcatcher.com/put" results.html --data "update=info" --impersonate "firefox"
170
+
171
+ # Send JSON data
172
+ scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}'
173
+ ```
174
+ Get the available options for the Command with `scrapling extract put --help` as follows:
175
+ ```bash
176
+ Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILE
177
+
178
+ Perform a PUT request and save the content to a file.
179
+
180
+ The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.
181
+
182
+ Options:
183
+ -d, --data TEXT Form data to include in the request body
184
+ -j, --json TEXT JSON data to include in the request body (as string)
185
+ -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)
186
+ --cookies TEXT Cookies string in format "name1=value1;name2=value2"
187
+ --timeout INTEGER Request timeout in seconds (default: 30)
188
+ --proxy TEXT Proxy URL in format "http://username:password@host:port"
189
+ -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
190
+ -p, --params TEXT Query parameters in format "key=value" (can be used multiple times)
191
+ --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)
192
+ --verify / --no-verify Whether to verify SSL certificates (default: True)
193
+ --impersonate TEXT Browser to impersonate (e.g., chrome, firefox).
194
+ --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)
195
+ --help Show this message and exit.
196
+ ```
197
+
198
+ 4. **Delete Request**
199
+
200
+ ```bash
201
+ scrapling extract delete [URL] [OUTPUT_FILE] [OPTIONS]
202
+ ```
203
+
204
+ **Examples:**
205
+ ```bash
206
+ # Send data
207
+ scrapling extract delete "https://scrapling.requestcatcher.com/delete" results.html
208
+
209
+ # Send JSON data
210
+ scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome"
211
+ ```
212
+ Get the available options for the Command with `scrapling extract delete --help` as follows:
213
+ ```bash
214
+ Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILE
215
+
216
+ Perform a DELETE request and save the content to a file.
217
+
218
+ The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.
219
+
220
+ Options:
221
+ -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times)
222
+ --cookies TEXT Cookies string in format "name1=value1;name2=value2"
223
+ --timeout INTEGER Request timeout in seconds (default: 30)
224
+ --proxy TEXT Proxy URL in format "http://username:password@host:port"
225
+ -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
226
+ -p, --params TEXT Query parameters in format "key=value" (can be used multiple times)
227
+ --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True)
228
+ --verify / --no-verify Whether to verify SSL certificates (default: True)
229
+ --impersonate TEXT Browser to impersonate (e.g., chrome, firefox).
230
+ --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True)
231
+ --help Show this message and exit.
232
+ ```
233
+
234
+ ### Browsers fetching
235
+
236
+ 1. **fetch - Handle Dynamic Content**
237
+
238
+ For websites that load content with dynamic content or have slight protection
239
+
240
+ ```bash
241
+ scrapling extract fetch [URL] [OUTPUT_FILE] [OPTIONS]
242
+ ```
243
+
244
+ **Examples:**
245
+ ```bash
246
+ # Wait for JavaScript to load content and finish network activity
247
+ scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle
248
+
249
+ # Wait for specific content to appear
250
+ scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded"
251
+
252
+ # Run in visible browser mode (helpful for debugging)
253
+ scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources
254
+ ```
255
+ Get the available options for the Command with `scrapling extract fetch --help` as follows:
256
+ ```bash
257
+ Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILE
258
+
259
+ Use DynamicFetcher to fetch content with browser automation.
260
+
261
+ The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.
262
+
263
+ Options:
264
+ --headless / --no-headless Run browser in headless mode (default: True)
265
+ --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)
266
+ --network-idle / --no-network-idle Wait for network idle (default: False)
267
+ --timeout INTEGER Timeout in milliseconds (default: 30000)
268
+ --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
269
+ -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
270
+ --wait-selector TEXT CSS selector to wait for before proceeding
271
+ --locale TEXT Browser locale (default: en-US)
272
+ --stealth / --no-stealth Enable stealth mode (default: False)
273
+ --hide-canvas / --show-canvas Add noise to canvas operations (default: False)
274
+ --disable-webgl / --enable-webgl Disable WebGL support (default: False)
275
+ --proxy TEXT Proxy URL in format "http://username:password@host:port"
276
+ -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
277
+ --help Show this message and exit.
278
+ ```
279
+
280
+ 2. **stealthy-fetch - Bypass Protection**
281
+
282
+ For websites with anti-bot protection or Cloudflare protection
283
+
284
+ ```bash
285
+ scrapling extract stealthy-fetch [URL] [OUTPUT_FILE] [OPTIONS]
286
+ ```
287
+
288
+ **Examples:**
289
+ ```bash
290
+ # Bypass basic protection
291
+ scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md
292
+
293
+ # Solve Cloudflare challenges
294
+ scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"
295
+
296
+ # Use proxy for anonymity
297
+ scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
298
+ ```
299
+ Get the available options for the Command with `scrapling extract stealthy-fetch --help` as follows:
300
+ ```bash
301
+ Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILE
302
+
303
+ Use StealthyFetcher to fetch content with advanced stealth features.
304
+
305
+ The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively.
306
+
307
+ Options:
308
+ --headless / --no-headless Run browser in headless mode (default: True)
309
+ --block-images / --allow-images Block image loading (default: False)
310
+ --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)
311
+ --block-webrtc / --allow-webrtc Block WebRTC entirely (default: False)
312
+ --humanize / --no-humanize Humanize cursor movement (default: False)
313
+ --solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
314
+ --allow-webgl / --block-webgl Allow WebGL (default: True)
315
+ --network-idle / --no-network-idle Wait for network idle (default: False)
316
+ --disable-ads / --allow-ads Install uBlock Origin addon (default: False)
317
+ --timeout INTEGER Timeout in milliseconds (default: 30000)
318
+ --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
319
+ -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
320
+ --wait-selector TEXT CSS selector to wait for before proceeding
321
+ --geoip / --no-geoip Use IP/Proxy geolocation for timezone/locale (default: False)
322
+ --proxy TEXT Proxy URL in format "http://username:password@host:port"
323
+ -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
324
+ --help Show this message and exit.
325
+ ```
326
+
327
+ ## When to use each Command
328
+
329
+ If you are not a Web Scraping expert and can't decide what to choose, you can use the following formula to help you decide:
330
+
331
+ - Use **`get`** with simple websites, blogs, or news articles
332
+ - Use **`fetch`** with modern web apps, or sites with dynamic content
333
+ - Use **`stealthy-fetch`** with protected sites, Cloudflare, or anti-bot systems
334
+
335
+ ## Legal and Ethical Considerations
336
+
337
+ ⚠️ **Important Guidelines:**
338
+
339
+ - **Check robots.txt**: Visit `https://website.com/robots.txt` to see scraping rules
340
+ - **Respect rate limits**: Don't overwhelm servers with requests
341
+ - **Terms of Service**: Read and comply with website terms
342
+ - **Copyright**: Respect intellectual property rights
343
+ - **Privacy**: Be mindful of personal data protection laws
344
+ - **Commercial use**: Ensure you have permission for business purposes
345
+
346
+ ---
347
+
348
+ *Happy scraping! Remember to always respect website policies and comply with all applicable legal requirements.*