Skip to content

Commit 96fcaa1

Browse files
committed
Converted to composer
1 parent f964064 commit 96fcaa1

File tree

12 files changed

+296
-214
lines changed

12 files changed

+296
-214
lines changed

.gitattributes

Lines changed: 0 additions & 2 deletions
This file was deleted.

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
composer.phar
2+
composer.lock
3+
/vendor/
4+
5+
# Node artifact files
6+
**/node_modules/
7+
8+
# Generated by MacOS
9+
.DS_Store
10+
11+
# Generated by Windows
12+
Thumbs.db

DOC.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
# Class HTML_Scraper
1+
# Class HTMLScraper
22
### Static Functions:
33
-`new_from($source)`
44

5-
Create a new HTML_Scraper object from the passed source.
5+
Create a new HTMLScraper object from the passed source.
66
`$source` can be of type `DOMNodeList`, `DOMNode` or `string`.
77

88
**Returns:**
99

1010
| Type | Description |
1111
|---|---|
12-
| `array` | When `$source` is an instance of `DOMNodeList` then returns an `array` of `HTML_Scraper` objects. |
13-
| `HTML_Scraper` | When `$source` is an instance of `DOMNode` or a `string` |
12+
| `array` | When `$source` is an instance of `DOMNodeList` then returns an `array` of `HTMLScraper` objects. |
13+
| `HTMLScraper` | When `$source` is an instance of `DOMNode` or a `string` |
1414

1515

1616
-`CSS_to_Xpath(string $path) : string`
@@ -20,7 +20,7 @@
2020
### Functions:
2121
-`__toString() : string`
2222

23-
Magic function to convert `HTML_Scraper` into a `string` containing the HTML code of the loaded document.
23+
Magic function to convert `HTMLScraper` into a `string` containing the HTML code of the loaded document.
2424

2525

2626
-`textContent() : string`
@@ -43,7 +43,7 @@
4343
Load HTML from a file.
4444

4545
- `$options`
46-
*see `$options` in `HTML_Scraper->load_HTML_str()`*
46+
*see `$options` in `HTMLScraper->load_HTML_str()`*
4747

4848
- `$context`
4949
*see `$context` in `stream_context_create()`*
@@ -74,11 +74,11 @@
7474

7575
-`querySelector(string $selector, int ...$items)`
7676

77-
Same as `HTML_Scraper->xpath()` except that it uses CSS selector instead of *XPath* path expression.
77+
Same as `HTMLScraper->xpath()` except that it uses CSS selector instead of *XPath* path expression.
7878

7979
-`xpath_extract($mapper, string $expr, int ...$items)`
8080

81-
Find `DOMNode`(s) in the same way as in `HTML_Scraper->xpath()` then extract data from the `DOMNode`(s) as specified by the `$mapper`.
81+
Find `DOMNode`(s) in the same way as in `HTMLScraper->xpath()` then extract data from the `DOMNode`(s) as specified by the `$mapper`.
8282

8383
- `$mapper`
8484
It can be any one of the `string` specified below or a `function` that takes a `DOMNode` and returns any extracted value.
@@ -91,7 +91,7 @@
9191

9292
-`querySelector_extract($mapper, string $selector, int ...$items)`
9393

94-
Same as `HTML_Scraper->xpath_extract()` except that it uses CSS selector instead of *XPath* path expression.
94+
Same as `HTMLScraper->xpath_extract()` except that it uses CSS selector instead of *XPath* path expression.
9595

9696
---
9797

@@ -111,7 +111,7 @@
111111

112112
-`xpath(DOMNode &$node, string $expr, int ...$items)`
113113

114-
Similar to `HTML_Scraper->xpath()` except that it works on a `DOMNode` instead of the `HTML_Scraper`'s `DOMDocument`.
114+
Similar to `HTMLScraper->xpath()` except that it works on a `DOMNode` instead of the `HTMLScraper`'s `DOMDocument`.
115115

116116
-`querySelector(DOMNode &$node, string $selector, int ...$items)`
117117

@@ -122,7 +122,7 @@
122122
Get one or more child nodes of the `DOMNode`.
123123

124124
- `$indexes`
125-
*See `$items` in `HTML_Scraper->expath()`.*
125+
*See `$items` in `HTMLScraper->xpath()`.*
126126

127127
**Returns:**
128128

@@ -154,4 +154,4 @@
154154
Removes the child elements of the passed `DOMNode` specified by the `...$indexes`.
155155

156156
- `$indexes`
157-
*See `$items` in `HTML_Scraper->expath()`.*
157+
*See `$items` in `HTMLScraper->xpath()`.*

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2018 Anshu Krishna
3+
Copyright (c) 2021 Anshu Krishna
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,14 @@ For *basic* documentation see the DOC file.
1818
### Example
1919
```php
2020
<?php
21-
require_once 'HTML_Scraper.php';
21+
require_once 'vendor/autoload.php';
2222

23-
$doc = new HTML_Scraper;
23+
use Krishna\DOMNodeHelper;
24+
use Krishna\HTMLScraper;
25+
26+
const TrimmedText = HTMLScraper::Extract_textContentTrim;
27+
28+
$doc = new HTMLScraper();
2429

2530
if(!$doc->load_HTML_file('https://www.royalroad.com/fiction/10073/the-wandering-inn')) {
2631
echo 'Unable to load data';
@@ -29,18 +34,17 @@ if(!$doc->load_HTML_file('https://www.royalroad.com/fiction/10073/the-wandering-
2934

3035
$data = [];
3136

32-
$data['title'] = $doc->querySelector_extract('textContentTrim', 'div.fic-title h1[property="name"]', 0);
37+
$data['title'] = $doc->querySelector_extract(TrimmedText, 'div.fic-title h1[property="name"]', 0);
3338

3439
$data['url'] = $doc->xpath_extract(function($meta) {
3540
return $meta->getAttribute('content');
3641
}, '//meta[@property="og:url"]', 0);
3742

38-
$data['description'] = $doc->querySelector_extract(function(&$div) {
43+
$data['description'] = htmlspecialchars($doc->querySelector_extract(function(&$div) {
3944
return trim(DOMNodeHelper::innerHTML($div));
40-
}, 'div.description div[property="description"]', 0);
45+
}, 'div.description div[property="description"]', 0));
4146

42-
$data['tags'] = $doc->querySelector_extract('textContentTrim', 'span.tags span[property="genre"]');
47+
$data['tags'] = $doc->querySelector_extract(TrimmedText, 'span.tags span[property="genre"]');
4348

4449
var_dump($data);
45-
?>
4650
```

composer.json

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"name": "anshu-krishna/html-scraper",
3+
"description": "A set of PHP classes to simplify data extraction from HTML.",
4+
"type": "library",
5+
"license": "MIT",
6+
"authors": [
7+
{
8+
"name": "Anshu Krishna",
9+
"email": "anshu.krishna5@gmail.com"
10+
}
11+
],
12+
"version": "3.5.0",
13+
"require": {
14+
"php": ">=8.0.0"
15+
},
16+
"autoload": {
17+
"psr-4": {
18+
"Krishna\\": "src"
19+
}
20+
}
21+
}

example/.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
composer.phar
2+
composer.lock
3+
/vendor/
4+
5+
# Node artifact files
6+
**/node_modules/
7+
8+
# Generated by MacOS
9+
.DS_Store
10+
11+
# Generated by Windows
12+
Thumbs.db

example/composer.json

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
{
2+
"name": "anshu-krishna/scraper-example",
3+
"description": "Examples for HTMLScraper",
4+
"type": "project",
5+
"license": "MIT",
6+
"authors": [
7+
{
8+
"name": "Anshu Krishna",
9+
"email": "anshu.krishna5@gmail.com"
10+
}
11+
],
12+
"version": "1.0.0",
13+
"repositories": [{
14+
"type": "path",
15+
"url": "..",
16+
"options": {
17+
"symlink": true
18+
}
19+
}],
20+
"require": {
21+
"php": ">=8.0.0",
22+
"anshu-krishna/html-scraper" : "*"
23+
}
24+
}

example/example.php

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,18 @@
1+
<style>
2+
pre {
3+
white-space: pre-wrap;
4+
}
5+
</style>
6+
<pre>
17
<?php
2-
// function echo_html(string $html_source) {
3-
// echo '<pre>', htmlentities($html_source), '</pre>';
4-
// }
8+
require_once 'vendor/autoload.php';
59

6-
// function echo_DOMNode(DOMNode $node) {
7-
// echo_html(DOMNodeHelper::outerHTML($node));
8-
// }
10+
use Krishna\DOMNodeHelper;
11+
use Krishna\HTMLScraper;
912

10-
require_once '../HTML_Scraper.php';
13+
const TrimmedText = HTMLScraper::Extract_textContentTrim;
1114

12-
$doc = new HTML_Scraper;
15+
$doc = new HTMLScraper();
1316

1417
if(!$doc->load_HTML_file('sample_data_file.html')) {
1518
echo 'Unable to load data';
@@ -26,13 +29,13 @@
2629

2730
$data = [];
2831

29-
$data['title'] = $doc->querySelector_extract('textContentTrim', 'div.fic-title h1[property="name"]', 0);
32+
$data['title'] = $doc->querySelector_extract(TrimmedText, 'div.fic-title h1[property="name"]', 0);
3033

3134
$data['url'] = $doc->xpath_extract(function($meta) {
3235
return $meta->getAttribute('content');
3336
}, '//meta[@property="og:url"]', 0);
3437

35-
$data['auth'] = $doc->querySelector_extract('textContentTrim', 'div.fic-title h4[property="author"] span[property="name"]', 0);
38+
$data['auth'] = $doc->querySelector_extract(TrimmedText, 'div.fic-title h4[property="author"] span[property="name"]', 0);
3639

3740
$data['auth_link'] = $doc->querySelector_extract(function(&$a) {
3841
return 'https://www.royalroad.com' . $a->getAttribute('href');
@@ -56,11 +59,11 @@
5659
return 275 * $pages;
5760
}, 'li[property="numberOfPages"]', 0);
5861

59-
$data['desc'] = $doc->querySelector_extract(function(&$div) {
62+
$data['desc'] = htmlspecialchars($doc->querySelector_extract(function(&$div) {
6063
return trim(DOMNodeHelper::innerHTML($div));
61-
}, 'div.description div[property="description"]', 0);
64+
}, 'div.description div[property="description"]', 0));
6265

63-
$data['tags'] = $doc->querySelector_extract('textContentTrim', 'span.tags span[property="genre"]');
66+
$data['tags'] = $doc->querySelector_extract(TrimmedText, 'span.tags span[property="genre"]');
6467

6568
$replace = NULL;
6669
if($data['url'] !== NULL && preg_match("/http[s]?:\/\/www\.royalroad\.com\/(.+)\/?/", $data['url'], $mtc)) {
@@ -85,6 +88,4 @@
8588
if(is_array($data['ch_links'])) {
8689
$data['chaps'] = count($data['ch_links']);
8790
}
88-
89-
var_dump($data);
90-
?>
91+
echo json_encode($data, JSON_PRETTY_PRINT | JSON_INVALID_UTF8_SUBSTITUTE | JSON_PARTIAL_OUTPUT_ON_ERROR);

example/example_css_to_xpath.php

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@
2525
<header>CSS</header>
2626
<header>XPath</header>
2727
<?php
28-
require_once '../html_scraper.php';
28+
require_once 'vendor/autoload.php';
29+
30+
use Krishna\HTMLScraper;
31+
2932
$examples = [
3033
'div',
3134
'div.abc',
@@ -47,7 +50,7 @@
4750
$examples = array_map(function($selector) {
4851
return implode(PHP_EOL, array_map(function($str) {
4952
return "<span>" . htmlspecialchars($str) . "</span>";
50-
}, [$selector, HTML_Scraper::CSS_to_Xpath($selector)]));
53+
}, [$selector, HTMLScraper::CSS_to_Xpath($selector)]));
5154
}, $examples);
5255

5356
echo implode(PHP_EOL, $examples);

0 commit comments

Comments
 (0)