Spark高級(jí)數(shù)據(jù)分析（影印版）

定　價(jià)：￥56.00

作　者：	（美）里扎等著，
出版社：	東南大學(xué)出版社
叢編項(xiàng)：
標(biāo)　簽：	計(jì)算機(jī)/網(wǎng)絡(luò) 數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘數(shù)據(jù)庫(kù)

購(gòu)買這本書(shū)可以去

ISBN：	9787564159108	出版時(shí)間：	2015-09-01	包裝：	平裝
開(kāi)本：	16開(kāi)	頁(yè)數(shù)：	260	字?jǐn)?shù)：

內(nèi)容簡(jiǎn)介

　　在里扎等編著的《Spark高級(jí)數(shù)據(jù)分析（影印版）（英文版）》這本實(shí)用書(shū)籍中，4位Cloude陽(yáng)公司的數(shù)據(jù)科學(xué)家講解了一系列自包含模式，用于在 Spark中進(jìn)行大規(guī)模數(shù)據(jù)分析。本書(shū)作者們把Spark、統(tǒng)計(jì)原理和現(xiàn)實(shí)世界中的數(shù)據(jù)集合放到一起，通過(guò)實(shí) 例教你如何解決數(shù)據(jù)分析問(wèn)題。你將從Spark及其生態(tài)系統(tǒng)的介紹開(kāi)始，然后深入運(yùn)用標(biāo)準(zhǔn)技巧的模式——?dú)w類、聚合過(guò)濾及異常檢測(cè)等，這些技巧被用于生物基因、安全和金融等行業(yè) 。如果你對(duì)機(jī)器學(xué)習(xí)和統(tǒng)計(jì)學(xué)有初步了解，使用Java 、Pytton或者Scala編程，就會(huì)發(fā)現(xiàn)這些模式對(duì)于你的數(shù)據(jù)分析應(yīng)用程序會(huì)非常有用。模式包括：音樂(lè)推薦和Audioscrobbler數(shù)據(jù)集合用決策樹(shù)分析森林覆蓋用K均值聚合檢測(cè)網(wǎng)絡(luò)流量中的異常用潛在語(yǔ)義分析理解維基百科用GraphX分析共生網(wǎng)絡(luò) 用地理空間和瞬態(tài)數(shù)據(jù)分析紐約市出租車路線的數(shù)據(jù) 用蒙地卡羅模擬來(lái)估計(jì)金融風(fēng)險(xiǎn) 分析基因數(shù)據(jù)和BDG項(xiàng)目通過(guò)PySpark和Thunder分析神經(jīng)造影數(shù)據(jù)

作者簡(jiǎn)介

暫缺《Spark高級(jí)數(shù)據(jù)分析（影印版）》作者簡(jiǎn)介

圖書(shū)目錄

Foreword Preface 1. Analyzing Big Data The Challenges of Data Science Introducing Apache Spark About This Book 2. Introduction to Data Analysis with Scala and Spark Scala for Data Scientists The Spark Programming Model Record Linkage Getting Started: The Spark Shell and SparkContext Bringing Data from the Cluster to the Client Shipping Code from the Client to the Cluster Structuring Data with Tuples and Case Classes Aggregations Creating Histograms Summary Statistics for Continuous Variables Creating Reusable Code for Computing Summary Statistics Simple Variable Selection and Scoring Where to Go from Here 3. Recommending Music and the Audioscrobbler Data Set Data Set The Alternating Least Squares Recommender Algorithm Preparing the Data Building a First Model Spot Checking Recommendations Evaluating Recommendation Quality Computing AUC Hyperparameter Selection Making Recommendations Where to Go from Here 4. Predicting Forest Cover with Decision Trees Fast Forward to Regression Vectors and Features Training Examples Decision Trees and Forests Covtype Data Set Preparing the Data A First Decision Tree Decision Tree Hyperparameters Tuning Decision Trees Categorical Features Revisited Random Decision Forests Making Predictions Where to Go from Here 5. Anomaly Detection in Network Traffic with K-means Clustering Anomaly Detection K-means Clustering Network Intrusion KDD Cup 1999 Data Set A First Take on Clustering Choosing k Visualization in R Feature Normalization Categorical Variables Using Labels with Entropy Clustering in Action Where to Go from Here 6. Understanding Wikipedia with Latent Semantic Analysis The Term-Document Matrix Getting the Data Parsing and Preparing the Data Lemmatization Computing the TF-IDFs Singular Value Decomposition Finding Important Concepts Querying and Scoring with the Low-Dimensional Representation Term-Term Relevance Document-Document Relevance Term-Document Relevance Multiple-Term Queries Where to Go from Here 7. Analyzing Co-occurrence Networks with GraphX The MEDLINE Citation Index: A Network Analysis Getting the Data Parsing XML Documents with Scala's XML Library Analyzing the MeSH Major Topics and Their Co-occurrences Constructing a Co-occurrence Network with GraphX Understanding the Structure of Networks Connected Components Degree Distribution Filtering Out Noisy Edges Processing EdgeTriplets Analyzing the Filtered Graph Small-World Networks Cliques and Clustering Coefficients Computing Average Path Length with Pregel Where to Go from Here 8. 6eospatial and Temporal Data Analysis on the New York City Taxi Trip Data Getting the Data Working with Temporal and Geospatial Data in Spark Temporal Data with JodaTime and NScalaTime Geospatial Data with the Esri Geometry API and Spray Exploring the Esri Geometry API Intro to GeoJSON Preparing the New York City Taxi Trip Data Handling Invalid Records at Scale Geospatial Analysis Sessionization in Spark Building Sessions: Secondary Sorts in Spark Where to Go from Here 9. Estimating Financial Risk through Monte Carlo Simulation Terminology Methods for Calculating VaR Variance-Covariance Historical Simulation Monte Carlo Simulation Our Model Getting the Data Preprocessing Determining the Factor Weights Sampling The Multivariate Normal Distribution Running the Trials Visualizing the Distribution of Returns Evaluating Our Results Where to Go from Here 10. Analyzing Genomics Data and the BDG Project Decoupling Storage from Modeling Ingesting Genomics Data with the ADAM CLI Parquet Format and Columnar Storage Predicting Transcription Factor Binding Sites from ENCODE Data Querying Genotypes from the 1000 Genomes Project Where to Go from Here 11. Analyzing Neuroimaging Data with PySpark and Thunder Overview of PySpark PySpark Internals Overview and Installation of the Thunder Library Loading Data with Thunder Thunder Core Data Types Categorizing Neuron Types with Thunder Where to Go from Here A.Deeper into Spark B.Upcoming MLlib Pipelines API Index