Preface Chapter 1: Introduction to Data Analysis Origins of data analysis The scientific method Actuarial science Calculated by steam A spectacular example Herman Hollerith ENIAC VisiCalc Data, information, and knowledge Why Java? Java Integrated Development Environments Summary Chapter 2: Data Pre_processing Data types Variables Data points and datasets Null values Relational database tables Key fields Key-value pairs Hash tables File formats Microsoft Excel data XML and JSON data Generating test datasets Metadata Data cleaning Data scaling Data filtering Sorting Merging Hashing Summary Chapter 3: Data Visualization Tables and graphs Scatter plots Line graphs Bar charts Histograms Time series Java implementation Moving average Data ranking Frequency distributions The normal distribution A thought experiment The exponential distribution Java example Summary Chapter 4: Statistics Descriptive statistics Random sampling Random variables Probability distributions Cumulative distributions The binomial distribution Multivariate distributions Conditional probability The independence of probabilistic events Contingency tables Bayes' theorem Covariance and correlation The standard normal distribution The central limit theorem Confidence intervals Hypothesis testing Summary Chapter 5: Relational Databases The relation data model Relational databases Foreign keys Relational database design Creating a database SQL commands Inserting data into the database Database queries SQL data types JDBC Using a JDBC PreparedStatement Batch processing Database views Subqueries Table indexes Summary Chapter 6: Regression Analysis Linear regression Linear regression in Excel Computing the regression coefficients Variation statistics Java implementation of linear regression Anscombe's quartet Polynomial regression Multiple linear regression The Apache Commons implementation Curve fitting Summary Chapter 7: Classification Analysis Decision trees What does entropy have to do with it? The ID3 algorithm Java Implementation of the ID3 algorithm The Weka platform The ARFF filetype for data Java implementation with Weka Bayesian classifiers Java implementation with Weka Support vector machine algorithms Logistic regression K-Nearest Neighbors Fuzzy classification algorithms Summary Chapter 8: Cluster Analysis Measuring distances The curse of dimensionality Hierarchical clustering Weka implementation K-means clustering K-mecloids clustering Affinity propagation clustering Summary Chapter 9: Recommender Systems Utility matrices Similarity measures Cosine similarity A simple recommender system Amazon's item-to-item collaborative filtering recommender Implementing user ratings Large sparse matrices Using random access files The Netflix prize Summary Chapter 10: NoSQL Databases The Map data structure SQL versus NoSQL The Mongo database system The Library database Java development with MongoDB The MongoDB extension for geospatial databases Indexing in MongoDB Why NoSQL and why MongoDB? Other NoSQL database systems Summary Chapter 11:Data Analysis with Java Scaling, data striping, and sharding Google's PageRank algorithm Google's MapReduce framework Some examples of MapReduce applications The WordCount example Scalability Matrix multiplication with MapReduce MapReduce in MongoDB Apache Hadoop Hadoop MapReduce Summary Appendix: Java Tools The command line Java NetBeans MySQL MySQL Workbench Accessing the MySQL database from NetBeans The Apache Commons Math Library The javax JSON Library The Weka libraries MongoDB Index