Apache spark是一款全新開(kāi)發(fā)的分布式框架,特別對(duì)低延遲任務(wù)和內(nèi)存數(shù)據(jù)存儲(chǔ)進(jìn)行了優(yōu)化。它結(jié)合了速度、可擴(kuò)展性、內(nèi)存處理以及容錯(cuò)性,是極少數(shù)適用于并行計(jì)算的框架之一,同時(shí)還非常易于編程,擁有一套靈活、表達(dá)能力豐富、功能強(qiáng)大的API設(shè)計(jì)?!禨park機(jī)器學(xué)習(xí)(影印版 英文版)》指導(dǎo)你學(xué)習(xí)用于載入及處理數(shù)據(jù)的spark APl的基礎(chǔ)知識(shí),以及如何為各種機(jī)器學(xué)習(xí)模型準(zhǔn)備適合的輸入數(shù)據(jù):另有詳細(xì)的例子和實(shí)際生活中的真實(shí)案例來(lái)幫助你學(xué)習(xí)包括推薦系統(tǒng)、分類、回歸、聚類、降維在內(nèi)的常見(jiàn)機(jī)器學(xué)習(xí)模型,你還會(huì)看到如大規(guī)模文本處理之類的高級(jí)主題、在線機(jī)器學(xué)習(xí)的相關(guān)方法以及使用spa rk st reami ng進(jìn)行模型評(píng)估。
作者簡(jiǎn)介
暫缺《Spark機(jī)器學(xué)習(xí)(影印版 英文版)》作者簡(jiǎn)介
圖書(shū)目錄
Preface Chapter 1: Getting Up and Running with Spark Installing and setting up Spark locally Spark clusters The Spark programming model SparkContext and SparkConf The Spark shell Resilient Distributed Datasets Creating RDDs Spark operations Caching RDDs Broadcast variables and accumulators The first step to a Spark program in Scala The first step to a Spark program in Java The first step to a Spark program in Python Getting Spark running on Amazon EC2 Launching an EC2 Spark cluster Summary Chapter 2: Designing a Machine Learning System Introducing MovieStream Business use cases for a machine learning system Personalization Targeted marketing and customer segmentation Predictive modeling and analytics Types of machine learning models The components of a data-driven machine learning system Data ingestion and storage Data cleansing and transformation Model training and testing loop Model deployment and integration Model monitoring and feedback Batch versus real time An architecture for a machine learning system Practical exercise Summary Chapter 3: Obtaining, Processing, and Preparing Data with Spark Accessing publicly available datasets The MovieLens lOOk dataset Exploring and visualizing your data Exploring the user dataset Exploring the movie dataset Exploring the rating dataset Processing and transforming your data Filling in bad or missing data Extracting useful features from your data Numerical features Categorical features Derived features Transforming timestamps into categorical features Text features Simple text feature extraction Normalizing features Using MLlib for feature normalization Using packages for feature extraction Summary Chapter 4: Building a Recommendation Engine with Spark Types of recommendation models Content-based filtering Collaborative filtering Matrix factorization Extracting the right features from your data Extracting features from the MovieLens 100k dataset Training the recommendation model Training a model on the MovieLens 100k dataset Training a model using implicit feedback data Using the recommendation model User recommendations Generating movie recommendations from the MovieLens 100k dataset Item recommendations Generating similar movies for the MovieLens 100k dataset Evaluating the performance of recommendation models Mean Squared Error Mean average precision at K Using MLlib's built-in evaluation functions RMSE and MSE MAP Summary Chapter 5: Building a Classification Model with Spark Types of classification models Linear models Logistic regression Linear support vector machines The na'fve Bayes model Decision trees Extracting the right features from your data Extracting features from the Kaggle/StumbleUpon evergreen classification dataset Training classification models Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset Using classification models Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset Evaluating the performance of classification models Accuracy and prediction error Precision and recall ROC curve and AUC Improving model performance and tuning parameters Feature standardization Additional features Using the correct form of data Tuning model parameters Linear models Decision trees The na'fve Bayes model Cross-validation Summary Chapter 6: Buildin a~ssion Model with Spark Types of regression models Least squares regression Decision trees for regression Extracting the right features from your data Extracting features from the bike sharing dataset Creating feature vectors for the linear model Creating feature vectors for the decision tree Training and using regression models Training a regression model on the bike sharing dataset Evaluating the performance of regression models Mean Squared Error and Root Mean Squared Error Mean Absolute Error Root Mean Squared Log Error The R-squared coefficient Computing performance metrics on the bike sharing dataset Linear model Decision tree Improving model performance and tuning parameters Transforming the target variable Impact of training on log-transformed targets Tuning model parameters Creating training and testing sets to evaluate parameters The impact of parameter settings for linear models The impact of parameter settings for the decision tree Summary Chapter 7: Building a Clustering Model with Spark Types of clustering models K-means clustering Initialization methods Variants Mixture models Hierarchical clustering Extracting the right features from your data Extracting features from the MovieLens dataset Extracting movie genre labels Training the recommendation model Normalization Training a clustering model Training a clustering model on the MovieLens dataset Making predictions using a clustering model Interpreting cluster predictions on the MovieLens dataset Interpreting the movie clusters Evaluating the performance of clustering models Internal evaluation metrics External evaluation metrics Computing performance metrics on the MovieLens dataset Tuning parameters for clustering models Selecting K through cross-validation Summary Chapter 8: Dimensionality Reduction with Spark Types of dimensionality reduction Principal Components Analysis Singular Value Decomposition Relationship with matrix factorization Clustering as dimensionality reduction Extracting the right features from your data Extracting features from the LFW dataset Exploring the face data Visualizing the face data Extracting facial images as vectors Normalization Training a dimensionality reduction model Running PCA on the LFW dataset Visualizing the Eigenfaces Interpreting the Eigenfaces Using a dimensionality reduction model Projecting data using PCA on the LFW dataset The relationship between PCA and SVD Evaluating dimensionality reduction models Evaluating k for SVD on the LFW dataset Summary Chapter 9: Advanced Text Processing with Spark What's so special about text data? Extracting the right features from your data Term weighting schemes Feature hashing Extracting the TF-IDF features from the 20 Newsgroups dataset Exploring the 20 Newsgroups data Applying basic tokenization Improving our tokenization Removing stop words Excluding terms based on frequency A note about stemming Training a TF-IDF model Analyzing the TF-IDF weightings Using a TF-IDF model Document similarity with the 20 Newsgroups dataset and TF-IDF features Training a text classifier on the 20 Newsgroups dataset using TF-IDF Evaluating the impact of text processing Comparing raw features with processed TF-IDF features on the 20 Newsgroups dataset Word2Vec models Word2Vec on the 20 Newsgroups dataset Summary Chapter 10: Real-time Machine Learning withSpark Streaming Online learning Stream processing An introduction to Spark Streaming Input sources Transformations Actions Window operators Caching and fault tolerance with Spark Streaming Creating a Spark Streaming application The producer application Creating a basic streaming application Streaming analytics Stateful streaming Online learning with Spark Streaming Streaming regression A simple streaming regression program Creating a streaming data producer Creating a streaming regression model Streaming K-means Online model evaluation Comparing model performance with Spark Streaming Summary Index