Soumen Chakrabarti,Web搜索與挖掘領(lǐng)域的知名專家,ACM Transactions on the Web副主編。加州大學(xué)伯克利分校博士,目前是印度理工學(xué)院計(jì)算機(jī)科學(xué)與工程系副教授。曾經(jīng)供職于IBM Almaden研究中心,從事超文本數(shù)據(jù)庫和數(shù)據(jù)挖掘方面的工作。他有豐富的實(shí)際項(xiàng)目開發(fā)經(jīng)驗(yàn),開發(fā)了多個(gè)Web挖掘系統(tǒng),并獲得了多項(xiàng)美國專利。
圖書目錄
INTRODUCTION 1.1 Crawling and Indexing 1.2 Topic Directories 1.3 Clustering and Classification 1.4 Hyperlink Analysis 1.5 Resource Discovery and Vertical Portals 1.6 Structured vs. Unstructured Data Mining 1.7 Bibliographic Notes PART Ⅰ INFRASTRUCTURE 2 CRAWLING THE WEB 2.1 HTML and HTTP Basics 2.2 Crawling Basics 2.3 Engineering Large-Scale Crawlers 2.3.1 DNS Caching, Prefetching, and Resolution 2.3.2 Multiple Concurrent Fetches 2.3.3 Link Extraction and Normalization 2.3.4 Robot Exclusion 2.3.5 Eliminating Already-Visited URLs 2.3.6 Spider Traps 2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages 2.3.8 Load Monitor and Manager 2.3.9 Per-Server Work-Queues 2.3.10 Text Repository 2.3.11 Refreshing Crawled Pages 2.4 Putting Together a Crawler 2.4.1 Design of the Core Components 2.4.2 Case Study: Using w3c-1 i bwww 2.5 Bibliographic Notes 3 WEB SEARCH AND INFORMATION RETRIEVAL 3.1 Boolean Queries and the Inverted Index 3.1.1 Stopwords and Stemming 3.1.2 Batch Indexing and Updates 3.1.3 Index Compression Techniques 3.2 Relevance Ranking 3.2.1 Recall and Precision 3.2.2 The Vector-Space Model 3.2.3 Relevance Feedback and Rocchios Method 3.2.4 Probabilistic Relevance Feedback Models 3.2.5 Advanced Issues 3.3 Similarity Search 3.3.1 Handling "Find-Similar" Queries 3.3.2 Eliminating Near Duplicates via Shingling 3.3.3 Detecting Locally Similar Subgraphs of the Web 3.4 Bibliographic Notes PART Ⅱ LEARNING SIMILARITY AND CLUSTERING 4.1 Formulations and Approaches 4.1.1 Partitioning Approaches 4.1.2 Geometric Embedding Approaches 4.1.3 Generative Models and Probabilistic Approaches 4.2 Bottom-Up and Top-Down Partitioning Paradigms 4.2.1 Agglomerative Clustering 4.2.2 The k-Means Algorithm 4.3 Clustering and Visualization via Embeddings 4.3.1 Self-Organizing Maps (SOMs) 4.3.2 Multidimensional Scaling (MDS) and FastMap 4.3.3 Projections and Subspaces 4.3.4 Latent Semantic Indexing (LSI) 4.4 Probabilistic Approaches to Clustering 4.4.1 Generative Distributions for Documents 4.4.2 Mixture Models and Expectation Maximization (EM) 4.4.3 Multiple Cause Mixture Model (MCMM) 4.4.4 Aspect Models and Probabilistic LSI 4.4.5 Model and Feature Selection 4.5 Collaborative Filtering 4.5.1 Probabilistic Models 4.5.2 Combining Content-Based and Collaborative Features 4.6 Bibliographic Notes 5 SUPERVISED LEARNING 5.1 The Supervised Learning Scenario 5.2 Overview of Classification Strategies 5.3 Evaluating Text Classifiers 5.3.1 Benchmarks 5.3.2 Measures of Accuracy 5.4 Nearest Neighbor Learners 5.4.1 Pros and Cons 5.4.2 Is TFIDF Appropriate? 5.5 Feature Selection 5.5.1 Greedy Inclusion Algorithms 5.5.2 Truncation Algorithms 5.5.3 Comparison and Discussion 5.6 Bayesian Learners 5.6.1 Naive Bayes Learners 5.6.2 Small-Degree Bayesian Networks 5.7 Exploiting Hierarchy among Topics 5.7.1 Feature Selection 5.7.2 Enhanced Parameter Estimation 5.7.3 Training and Search Strategies 5.8 Maximum Entropy Learners 5.9 Discriminative Classification 5.9.1 Linear Least-Square Regression 5.9.2 Support Vector Machines 5.10 Hypertext Classification 5.10.1 Representing Hypertext for Supervised Learning 5.10.2 Rule Induction 5.11 Bibliographic Notes 6 SEMISUPERVISED LEARNING 6.1 Expectation Maximization 6.1.1 Experimental Results 6.1.2 Reducing the Belief in Unlabeled Documents 6.1.3 Modeling Labels Using Many Mixture Components …… PART Ⅲ APPLICATIONS