Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Michael Schrenk
Webbots, Spiders, and Screen Scrapers
Webbots, Spiders, and Screen Scrapers
ACKNOWLEDGMENTS
Introduction
Old-School Client-Server Technology
The Problem with Browsers
What to Expect from This Book
Learn from My Mistakes
Master Webbot Techniques
Leverage Existing Scripts
About the Website
About the Code
Requirements
Hardware
Software
Internet Access
A Disclaimer (This Is Important)
I. FUNDAMENTAL CONCEPTS AND TECHNIQUES
1. WHAT'S IN IT FOR YOU?
Uncovering the Internet's True Potential
What's in It for Developers?
Webbot Developers Are in Demand
Webbots Are Fun to Write
Webbots Facilitate "Constructive Hacking"
What's in It for Business Leaders?
Customize the Internet for Your Business
Capitalize on the Public's Inexperience with Webbots
Accomplish a Lot with a Small Investment
Final Thoughts
2. IDEAS FOR WEBBOT PROJECTS
Inspiration from Browser Limitations
Webbots That Aggregate and Filter Information for Relevance
Webbots That Interpret What They Find Online
Webbots That Act on Your Behalf
A Few Crazy Ideas to Get You Started
Help Out a Busy Executive
Save Money by Automating Tasks
Protect Intellectual Property
Monitor Opportunities
Verify Access Rights on a Website
Create an Online Clipping Service
Plot Unauthorized Wi-Fi Networks
Track Web Technologies
Allow Incompatible Systems to Communicate
Final Thoughts
3. DOWNLOADING WEB PAGES
Think About Files, Not Web Pages
Downloading Files with PHP's Built-in Functions
Downloading Files with fopen() and fgets()
Creating Your First Webbot Script
Executing Webbots in Command Shells
Executing Webbots in Browsers
Downloading Files with file()
Introducing PHP/CURL
Multiple Transfer Protocols
Form Submission
Basic Authentication
Cookies
Redirection
Agent Name Spoofing
Referer Management
Socket Management
Installing PHP/CURL
LIB_http
Familiarizing Yourself with the Default Values
Using LIB_http
http_get()
http_get_withheader()
Learning More About HTTP Headers
Examining LIB_http's Source Code
LIB_http Defaults
LIB_http Functions
Final Thoughts
4. PARSING TECHNIQUES
Parsing Poorly Written HTML
Standard Parse Routines
Using LIB_parse
Splitting a String at a Delimiter: split_string()
Parsing Text Between Delimiters: return_between()
Parsing a Data Set into an Array: parse_array()
Parsing Attribute Values: get_attribute()
Removing Unwanted Text: remove()
Useful PHP Functions
Detecting Whether a String Is Within Another String
Replacing a Portion of a String with Another String
Parsing Unformatted Text
Measuring the Similarity of Strings
Final Thoughts
Don't Trust a Poorly Coded Web Page
Parse in Small Steps
Don't Render Parsed Text While Debugging
Use Regular Expressions Sparingly
5. AUTOMATING FORM SUBMISSION
Reverse Engineering Form Interfaces
Form Handlers, Data Fields, Methods, and Event Triggers
Form Handlers
Data Fields
Methods
The GET Method
The POST Method
Event Triggers
Unpredictable Forms
JavaScript Can Change a Form Just Before Submission
Form HTML Is Often Unreadable by Humans
Cookies Aren't Included in the Form, but Can Affect Operation
Analyzing a Form
Final Thoughts
Don't Blow Your Cover
Correctly Emulate Browsers
Avoid Form Errors
6. MANAGING LARGE AMOUNTS OF DATA
Organizing Data
Naming Conventions
Storing Data in Structured Files
Storing Text in a Database
LIB_mysql
The insert() Function
The update() Function
The exe_sql() Function
Storing Images in a Database
Database or File?
Making Data Smaller
Storing References to Image Files
Compressing Data
Compressing Inbound Files
Compressing Files on Your Hard Drive
Removing Formatting
Thumbnailing Images
Final Thoughts
II. PROJECTS
7. PRICE-MONITORING WEBBOTS
The Target
Designing the Parsing Script
Initialization and Downloading the Target
Further Exploration
8. IMAGE-CAPTURING WEBBOTS
Example Image-Capturing Webbot
Creating the Image-Capturing Webbot
Binary-Safe Download Routine
Directory Structure
The Main Script
Initialization and Target Validation
Defining the Page Base
Creating a Root Directory for Imported File Structure
Parsing Image Tags from the Downloaded Web Page
The Image-Processing Loop
Creating the Local Directory Structure
Downloading and Saving the File
Further Exploration
Final Thoughts
9. LINK-VERIFICATION WEBBOTS
Creating the Link-Verification Webbot
Initializing the Webbot and Downloading the Target
Setting the Page Base
Parsing the Links
Running a Verification Loop
Generating Fully Resolved URLs
Downloading the Linked Page
Displaying the Page Status
Running the Webbot
LIB_http_codes
LIB_resolve_addresses
Further Exploration
10. ANONYMOUS BROWSING WEBBOTS
Anonymity with Proxies
Non-proxied Environments
Your Online Exposure
Proxied Environments
The Anonymizer Project
Writing the Anonymizer
Downloading and Preparing the Target Web Page
Modifying the <base> Tag
Parsing the Links
Substituting the Links
Displaying the Proxied Web Page
Final Thoughts
11. SEARCH-RANKING WEBBOTS
Description of a Search Result Page
What the Search-Ranking Webbot Does
Running the Search-Ranking Webbot
How the Search-Ranking Webbot Works
The Search-Ranking Webbot Script
Initializing Variables
Starting the Loop
Fetching the Search Results
Parsing the Search Results
Final Thoughts
Be Kind to Your Sources
Search Sites May Treat Webbots Differently Than Browsers
Spidering Search Engines Is a Bad Idea
Familiarize Yourself with the Google API
Further Exploration
12. AGGREGATION WEBBOTS
Choosing Data Sources for Webbots
Example Aggregation Webbot
Familiarizing Yourself with RSS Feeds
Writing the Aggregation Webbot
Downloading and Parsing the Target
Dealing with CDATA
Adding Filtering to Your Aggregation Webbot
Further Exploration
13. FTP WEBBOTS
Example FTP Webbot
PHP and FTP
Further Exploration
14. NNTP NEWS WEBBOTS
NNTP Use and History
Webbots and Newsgroups
Identifying News Servers
Identifying Newsgroups
Finding Articles in Newsgroups
Reading an Article from a Newsgroup
Further Exploration
15. WEBBOTS THAT READ EMAIL
The POP3 Protocol
Logging into a POP3 Mail Server
Reading Mail from a POP3 Mail Server
The POP3 LIST Command
The POP3 RETR Command
Other Useful POP3 Commands
Executing POP3 Commands with a Webbot
Further Exploration
Email-Controlled Webbots
Email Interfaces
16. WEBBOTS THAT SEND EMAIL
Email, Webbots, and Spam
Sending Mail with SMTP and PHP
Configuring PHP to Send Mail
Sending an Email with mail()
Writing a Webbot That Sends Email Notifications
Keeping Legitimate Mail out of Spam Filters
Sending HTML-Formatted Email
Further Exploration
Using Returned Emails to Prune Access Lists
Using Email as Notification That Your Webbot Ran
Leveraging Wireless Technologies
Writing Webbots That Send Text Messages
17. CONVERTING A WEBSITE INTO A FUNCTION
Writing a Function Interface
Defining the Interface
Analyzing the Target Web Page
Using describe_zipcode()
Getting the Session Value
Submitting the Form
Parsing and Returning the Result
Final Thoughts
Distributing Resources
Using Standard Interfaces
Designing a Custom Lightweight "Web Service"
III. ADVANCED TECHNICAL CONSIDERATIONS
18. SPIDERS
How Spiders Work
Example Spider
LIB_simple_spider
harvest_links()
archive_links()
get_domain()
exclude_link()
Experimenting with the Spider
Adding the Payload
Further Exploration
Save Links in a Database
Separate the Harvest and Payload
Distribute Tasks Across Multiple Computers
Regulate Page Requests
19. PROCUREMENT WEBBOTS AND SNIPERS
Procurement Webbot Theory
Get Purchase Criteria
Authenticate Buyer
Verify Item
Evaluate Purchase Triggers
Make Purchase
Evaluate Results
Sniper Theory
Get Purchase Criteria
Authenticate Buyer
Verify Item
Synchronize Clocks
Time to Bid?
Submit Bid
Evaluate Results
Testing Your Own Webbots and Snipers
Further Exploration
Final Thoughts
20. WEBBOTS AND CRYPTOGRAPHY
Designing Webbots That Use Encryption
SSL and PHP Built-in Functions
Encryption and PHP/CURL
A Quick Overview of Web Encryption
Local Certificates
Final Thoughts
21. AUTHENTICATION
What Is Authentication?
Types of Online Authentication
Strengthening Authentication by Combining Techniques
Authentication and Webbots
Example Scripts and Practice Pages
Basic Authentication
Session Authentication
Authentication with Cookie Sessions
How Cookies Work
Cookie Session Example
Authentication with Query Sessions
Final Thoughts
22. ADVANCED COOKIE MANAGEMENT
How Cookies Work
PHP/CURL and Cookies
How Cookies Challenge Webbot Design
Purging Temporary Cookies
Managing Multiple Users' Cookies
Further Exploration
23. SCHEDULING WEBBOTS AND SPIDERS
The Windows Task Scheduler
Preparing Your Webbots to Run as Scheduled Tasks
Scheduling a Webbot to Run Daily
Complex Schedules
Non-Calendar-Based Triggers
Final Thoughts
Determine the Webbot's Best Periodicity
Avoid Single Points of Failure
Add Variety to Your Schedule
IV. LARGER CONSIDERATIONS
24. DESIGNING STEALTHY WEBBOTS AND SPIDERS
Why Design a Stealthy Webbot?
Log Files
Access Logs
Error Logs
Custom Logs
Log-Monitoring Software
Stealth Means Simulating Human Patterns
Be Kind to Your Resources
Run Your Webbot During Busy Hours
Don't Run Your Webbot at the Same Time Each Day
Don't Run Your Webbot on Holidays and Weekends
Use Random, Intra-fetch Delays
Final Thoughts
25. WRITING FAULT-TOLERANT WEBBOTS
Types of Webbot Fault Tolerance
Adapting to Changes in URLs
Avoid Making Requests for Pages That Don't Exist
Follow Page Redirections
Maintain the Accuracy of Referer Values
Adapting to Changes in Page Content
Avoid Position Parsing
Use Relative Parsing
Look for Landmarks That Are Least Likely to Change
Adapting to Changes in Forms
Adapting to Changes in Cookie Management
Adapting to Network Outages and Network Congestion
Error Handlers
26. DESIGNING WEBBOT-FRIENDLY WEBSITES
Optimizing Web Pages for Search Engine Spiders
Well-Defined Links
Google Bombs and Spam Indexing
Title Tags
Meta Tags
Header Tags
Image alt Attributes
Web Design Techniques That Hinder Search Engine Spiders
JavaScript
Non-ASCII Content
Designing Data-Only Interfaces
XML
Lightweight Data Exchange
How Not to Design a Lightweight Interface
A Safer Method of Passing Variables to Webbots
SOAP
Advantages of SOAP
Disadvantages of SOAP
27. KILLING SPIDERS
Asking Nicely
Create a Terms of Service Agreement
Use the robots.txt File
Use the Robots Meta Tag
Building Speed Bumps
Selectively Allow Access to Specific Web Agents
Use Obfuscation
Use Cookies, Encryption, JavaScript, and Redirection
Authenticate Users
Update Your Site Often
Embed Text in Other Media
Setting Traps
Create a Spider Trap
Fun Things to Do with Unwanted Spiders
Final Thoughts
28. KEEPING WEBBOTS OUT OF TROUBLE
It's All About Respect
Copyright
Do Consult Resources
Don't Be an Armchair Lawyer
Copyrights Do Not Have to Be Registered
Assume "All Rights Reserved"
You Cannot Copyright a Fact
You Can Copyright a Collection of Facts if Presented Creatively
You Can Use Some Material Under Fair Use Laws
Trespass to Chattels
Internet Law
Final Thoughts
A. PHP/CURL REFERENCE
Creating a Minimal PHP/CURL Session
Initiating PHP/CURL Sessions
Setting PHP/CURL Options
CURLOPT_URL
CURLOPT_RETURNTRANSFER
CURLOPT_REFERER
CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS
CURLOPT_USERAGENT
CURLOPT_NOBODY and CURLOPT_HEADER
CURLOPT_TIMEOUT
CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR
CURLOPT_HTTPHEADER
CURLOPT_SSL_VERIFYPEER
CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH
CURLOPT_POST and CURLOPT_POSTFIELDS
CURLOPT_VERBOSE
CURLOPT_PORT
Executing the PHP/CURL Command
Retrieving PHP/CURL Session Information
Viewing PHP/CURL Errors
Closing PHP/CURL Sessions
B. STATUS CODES
HTTP Codes
NNTP Codes
C. SMS EMAIL ADDRESSES
About the Author
Colophon
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Webbots, Spiders, and Screen Scrapers
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset