Monday, November 11, 2013

test

Data Mining Data Mining also known as Knowledge or Data Discovery is the analysis of large volumes of data from various perspectives, identify relations between data and summarize them to into useful information, that can help to reduce costs, increase revenue or other business purposes. It is the process of finding relations and patterns among huge relational databases. These patterns are understanding groups of data records called cluster analysis, anomaly detection – finding unusual records and association rule mining – dependencies. It uses data from past to process the outcome of a particular situation or come to a solution for a problem. Data is stored in data warehouses that is used for analysis. Data Mining is used commonly in Direct mail marketing, Web site personalization, credit card fraud detection, Bio information, Text Analysis, Market Basket Analysis etc., Data mining uses database techniques like spatial indices. Patterns are seen as summaries of input data and can provide prediction results using a decision support system. Result interpretation, data preparation ,data collections and reporting are not a part data mining steps, instead, belong to the Knowledge Data Discovery process as extra steps. Data snooping, data dredging and data fishing are few methods used to again sample data representing a set of data. These methods aid in creating new hypotheses for testing against large volumes of data . Though the term data mining is relatively new, the technology behind it is not. Companies made use of powerful computers to sift through large volumes of data and analyze market research reports. However, innovations in computer processing, statistical software and disk storage are dramatically increasing the correctness and accuracy of this data at low cost. The main objective of data mining is to extract useful information from data and make further use of it. This involves using database, data pre-processing models, data management aspects, inference considerations, complexity considerations, interestingness metrics, post-processing of structures discovered, online updating and visualization. Extracting patterns of data manually has been there for centuries. Methods like Regression Analysis, Bayes Theorem were used earlier. The advent of increasing computer technology has increased data collection, storage and manipulation. Increase in the complexity and size of data sets has been augmented with automated, indirect data processing using Cluster Analysis, Neural networks, decision trees, genetic algorithms and support vector machines. Data mining process applies these methods to reveal patterns in large databases. It reduces the gap between artificial intelligence and applied statistics to database management by understanding how data is stored and indexed and use discovery algorithms and actual learning that allow data mining methods to be applied for even larger data sets. Data mining process has the following stages: Selection pre-processing Transformation Data Mining Interpretation/Evaluation. The process can be expressed in a simplified form as pre-processing, data mining and result validation. Choosing the right data mining technique is not an easy task. There are various commercial software that provide many possibilities and hence decision needs more expertise. Classifying the data mining methods simplifies understanding various methods. The classical scheme of Knowledge Discovery from Data provided by Fayyad in 1996 refers the following steps to complete the high level process of KDD, often called Simply Data Mining (Spate et al. 2006): Understanding and Developing the domain, and objectives of end-user. Creating the data set by choosing the appropriate data samples and set of variables. Data Pre-processing and cleaning. IQuality of result strongly depends on the quality of input data, and therefore the preprocessing step is crucial (Gibert et al 2008b). Data projection and reduction. Depending on the problem, it may be convenient to simplify the considered set of variables. The aim here is to keep a relevant set of variables describing the system adequately and efficiently. (Núñez et al 2004, Gibert et al 2008b). Choosing the data mining task, with reference to the goal of the KDD process. From clustering to time series forecasting, many different techniques exist for different purposes, or with different requirements. See (Kdnuggets 2006) for a survey of the most common data mining techniques. Selecting the Data Mining Algorithms. Once goals and tasks are decided, a set of methods have to be selected for searching patterns in the data sets. The choice of technique decides if parameter optimization is required or not. Data Mining – This is searching for patterns in data. If all of the previous steps are performe cautiously, then the results in this step will improve drastically. Interpreting the patterns from the previous steps This is crucial if the discovered patterns have to support effective improvement of expert’s knowledge about the analyzed phenomenon or further decision-making (Pérez-Bonilla et al 2007, Gibert et al 2010, Gibert et al 2008). If results look inconsistent possible further iteration of previous steps may be required to refine the analysis. Consolidating the interpretations. Reporting and documentation of results and also using them in the system they are made for. Predictive data mining is the most common data mining type and is used in many business applications. There are three crucial stages in data mining process : Initial Exploration Identifying patterns using verification/validation Deployment. Exploration: Exploration includes selecting the data sets or in case of large data sets, selecting the subset of fields , data transformations and cleaning data and conducting few preliminary feature selection operations to reduce the number of fields to a manageable range considering the kind of statistical methods used. The first phase of data mining process may have a simple predictors for a regression model. Use various statistical and graphical methods to identify the relevant fields. Also, it includes determining the complexity and models to be used in the next stage. Model Building and Validation: After considering various models, and selecting the best model depending on their predictive performance. We have wide range of techniques to fulfill this goal. Few of this are based on “Competitive Evaluation Models” which is applying different models to the same data sets and compare the results and chose the best one. These techniques considered as the vital in predictive data mining include Boosting, Bagging(Voting, Averaging), Meta-learning and Stacking (Stacked generalizations). Deployment: Deployment is the last stage where the model selected in the preceding stage is applied to the data set to make predictions or estimates of the expected outcome. Data Mining has gained importance as an information tool for business management where it reveals knowledge structures to guide decisions under limited certainty. New analytic techniques have been developed for business Data Mining like classification Trees. However, Data Mining is mostly uses conceptual statistical principles like Exploratory Data Analysis and modeling, sharing few concepts of the techniques with both of them. Data Mining also uses black box approach for knowledge discovery and data exploration. It not only uses traditional EDA techniques but also techniques like “Neural Networks” that generate valid predictions but can't identify the nature of interrelations between the fields of data set on which the outcomes or predictions are based. Data Mining is often considered to be “ a blend of statistics, AI and database research.” (Pregibon, 1997.). Due to its implied interest , this has gained importance and is emerging rapidly in the area of statistics. Important Concepts of Data Mining Bagging (Voting, Averaging) Bagging is a predictive data mining technique. Its uses voting for classification, averaging for regression problems with continuous dependent fields of interest. It combines the predictions from various models, or from the same model for different sets of input. Bagging can also be applied to complex and unstable results that are a result of applying complex models on small data sets. On such small sets of data, we can repeatedly uses sub-samples of the data set and make use of a tree classifier to these successive samples. Deriving one prediction can be done by using all the trees from the different samples and then apply voting. The last classification is the one most often predicted by various trees. Few weighted combination of predictions are also used . A high level algorithm to generate weights in a weighted prediction or voting is called Boosting Procedure. Boosting Boosting is again a predictive data mining concept to generate multiple classifiers or models and also to derive weights for to combines predictions from various models in to a single prediction. A algorithm for boosting would be like: Apply a method to the data set where each field is assigned equal weight and computer classifications and then apply weights that are inversely proportional to the accuracy of the classification to each field. Assign greater weights for fields or observations are more difficult to classify and less weights for observations which are easy to classify. Apply classifier again to this weighted data and keep going with the next iteration. Boosting generates classifiers in sequence, where every consecutive classifier is good at classifying the data sets that are not classified properly by the classifiers preceding it. These classifiers can be combined during the Deployment Phase to come at a single best classification or prediction. This technique can also be applied to methods which do not support misclassification costs or weights. Random sub-sampling is used in such cases in successive steps of the looping boosting process. The probability of selecting a field into the sub sample is inversely proportional to the prediction accuracy of that observation in the last iteration. Data Mining Models All the data mining models address on to how to convert data into information, how to induce data mining methodologies into a data set, how to report information that can be easily used by stake-holders to make strategic decisions. Software tools to fit these designs are made specifically. Models range from “easy to understand” to incomprehensible like with Decision trees being easy ones and Neural Networks the harder ones. Rule Inductions and Regression models lie between these two models. In data mining context, various “frameworks” are proposed to act as blueprints to aid in the process of gathering data, analyzing data, generating results, implementing results and looking for improvements. One such model is CRISP (Cross-Industry Standard Process for Data Mining), proposed by European Consortium of companies. This approach gives the general sequence of Data Mining Projects: One more approach is the Six Sigma methodology. It is a data-driven, well-structured process to eliminate waste, defects or quality control issues of any kind in service delivery management, manufacturing and other business related activities. It is a popular model across different American industries, because of its successful implementations. The sequence of steps under this methodology called DMAIC steps which developed from quality improvement, manufacturing and process controlss traditions and suits well to the production environments. A similar kind of framework used by SAS Institute is called SEMMA – which focuses more on the technical activities of the data mining project. Predictive Data Mining It is a “black box” method that predicts the future depending on the past and present information. Large amounts of input are used. This is used for Data Mining projects with the objective of identifying a neural network or statistical model or other models used to predict a pattern on a particular data set. For example, a credit card company might want to use predictive data mining to come up with a model that identifies transactions that have high chances of being fraudulent. Other projects could be like the ones that identifies a segment of customers or clusters where exploratory and drill down methods are applied. Data deduction is another goal of data mining . Stacking (Stacked Generalization) Stacks apply to Predictive data mining to combine estimates or prediction from various models. It is very useful when different models are included in the projected. For example, if a project uses various classifiers like CHAID, or C&RT , Neural Networks and linear discriminant analysis. It is observed that combining estimates from different models results in more accurate predictions than the results from any single method. In stacking, predictions from various models are made as inputs into a meta-learner that combines all the predictions and outputs a final and best prediction. Other methods that combine predictions from various models is Bagging and Boosting. Text Mining Where Data mining deals with detecting patterns in numeric data, we may also need to detect patterns from data stored in the form of text. Text data is amorphous and not easy to deal. Text Mining deals with analyzing text documents by extracting concepts, key phrases etc and this input is processed in a such a way that it can be further analyzed with numeric data mining methodologies. Kinds of Data Mining Problems Classification/Segmentation forecasting Association rule extraction Sequence Detection Clustering Classical Techniques: Statistics, Neighborhoods and Clustering The Classics The techniques under this section are under use for decades. The techniques used in here are mostly for existing business problems. Statistics Statistical techniques were long used before they were even started applying to business applications. Statistical techniques are data driven and used to predict patterns or aid in building other predictive models. When trying to find a solution to a “data mining” problem, the user needs to decide if we want start with Statistical techniques or other techniques. Statistics for Prediction Regression is a powerful and most commonly used technique in statistics and normally prediction can at many other places be called regression. Linear Regression Regression and prediction are used interchangeably in statistics. The basic concept behind regression is that it would create a model that maps values in a way that lowest errors occur while making a prediction. A linear regression has one predictor and hence one prediction. The two variables are mapped on a two dimensional graph and the records are plotted along Y axis as prediction values and X axis the predictor values. The model can then be viewed as a line that reduces the error rate between the actual prediction value and the prediction on the model. The one line that could be drawn through the data values that minimizes the distance between the data values and the line is the one chosen by the model. It becomes complicated as we introduce more fields and better model a particular database problem. More predictors can be used, transformations can be applied to the predictors and they can be multiplied together and used in the equation and also modifications can accommodate response from predictions that only have yes/no values. Additional predictors to the linear simple regression produce more complicated lines considering more information called Multiple linear regression. They make better predictions. Nearest Neighbor Nearest Neighbor and Clustering prediction technique are oldest techniques of data mining. Clustering is grouping fields or records together. Nearest Neighbor is similar to Clustering in a way that to predict a value in a record, we look into records with similar predictor values in the past and use the prediction value from the value nearest to the unclassified record. Nearness in database has many factors rather than just one single factor. This technique is easy to understand and to use. Nearest neighbor is used in product prediction where when a user select product, we can show other options that are similar to the product selected. Similarly, it is used in predicting stock market data. One of the improvements under done to the nearest neighbor algorithm is to take a vote from “k” nearest neighbors rather considering only one nearest neighbor to the unclassified record. If the neighbor is close enough to the unclassified record then, a higher confidence in prediction can be witnessed. The degree of homogeneity in various predictions with in the nearest neighbors of “K” distance can be used too. If all the nearest neighbors make the same prediction, a higher confidence in the prediction can be seen than if a part of the records made one prediction and the other half made different predictions. Clustering Clustering groups similar records together as clusters. This gives a high level view of the database. Two Clustering systems are MicroVision from Equifax corporation and PRIZM system from Calritas corporation. They grouped together population by demographic information into clusters that might be useful in sales and direct marketing. Clustering also helps in picking out records that stick out from the rest of the group. Nearest neighbor is a refinement of Clustering. Both use distance as a feature to create data for predictions. In clustering each predictor is given equal importance. Clusters are built where all records have same values for a specific predictor on which clusters were based on. Making a homogeneous cluster with all predictor values being same is difficult when the predictors have many values or there are many predictors. The other constraint on clustering is that a reasonable amount of clusters are formed and reason depends on the user and it might lead to create a cluster with one record in it (generalization) . Many algorithms allow to select the number of clusters or provide a “knob” through which fewer or greater number of clusters can be created. N dimensional spacing in nearest neighbor and clustering mean that there is a space defined , where distance can be calculated. Non-hierarchical and Hierarchical Clustering There are two types of clusters, one that create a hierarchy of clusters and other that don't form a hierarchy. Hierarchical clustering makes hierarchy of clusters that can be big or small. Depending on the particular application cluster hierarchy is formed. We can decide on the number of clusters required in a hierarchical cluster. We can make clusters as many as the number of records in the data set. But, this doesn't help to understand data any better. Hierarchical cluster is a tree with small clusters merging to next level cluster and so on to create a highest level of clusters. This helps in user determining what they think is a correct number of clusters that can summarize data and provide useful information. There are two kinds of hierarchical Clustering Algorithms : Agglomerative – They start with every cluster containing single record which means as many clusters as there are records. Clusters that are near are merged to form next big cluster and this goes on till all other clusters are merged to form the next largest cluster containing all records in the topmost cluster of hierarchy. Divisive – This is just opposite of agglomerative technique. We start all records in one cluster and then we split them into smaller clusters and then into smaller clusters. Non-hierarchical technique can be used to make clusters easily from historical database but do need to make some decisions about number of clusters. Non-hierarchical techniques run multiple times initially with a random clustering and improve clustering by moving some records around. Non-hierarchical Clustering There are tow non-hierarchical clustering techniques. They are very fast to computer on data sets but do have few limitations. First one is the single pass method where the database must be passed only once to make clusters ( one record is read from the database only once). The other method is called reallocation methods where clusters are formed by reallocating or moving records from one cluster to another in the process of making better clusters. They need multiple passes through the database and are considered faster than hierarchical techniques. Allowing the user to decide on the number of clusters to be formed is not a good idea since user might no t be aware of the distinct clusters. Hierarchical Clustering Clusters are defined only by data and the number of clusters can be decreased or increased by moving down or up the hierarchy. Subdividing and merging of clusters is done two at a time. Next Generation Techniques: Trees, Networks and Rules Decision Trees It is a predictive model where each branch of a tree is a classification of question and the leaves are partition of data set with the classification. A decision tree divides data on each branch point without having to lose any data. Total number of records at a parent node , it is sum of all children node's records. The number of churners and non-churners is saved as we move down or up the tree. The model is easy to understand. Decision trees can be seen as making a segmentation with a reason of prediction on the data set like segmentation of products, customers, sales region etc., values that fall in each segment as they have similarity with respect the data being predicted. These predictive segments come with a description of the characteristics that define the predictive segment. Algorithms used to create decision trees are complex. Decision tree's ability to understand them easily, they are favored to build various understandable models. They allow complex ROI and profit models be added to the model. Decision trees are easy to translate into SQL for deployment in databases and also has proved that they are easy to integrate into the existing IT process and needs little pre-processing or cleansing of data for data mining purpose specifically. Decision trees have many important features of data mining and can be used in various business problems for both prediction and exploration. They have been used for problems like credit card attrition prediction to time series prediction. Decision trees can't be used for simple predictions where prediction is only a multiple of a predictor and hence can be solved by linear regression. Growing the tree is the first step of decision tree model. Algorithm builds a tree that works well on all the available data. Find the best possible question to be asked at each point of the tree. There would be nodes at the bottom of the tree that would all be of one type or other. A good and a bad question can decided on how a question can organize the data. The process in decision tree algorithms is similar when they build trees. Various algorithms look at distinct questions that break the data set into homogeneous segments considering the various classes to be predicted. Few algorithms use heuristics to pick questions. CART picks all the questions and after trying all the questions, it picks the best one to break the data into segments and again ask the question individually to all the segments. Decision trees stop growing when: A segment has only one record. All the records have in a segment have similar characteristics. Further splitting doesn't lead to any substantial improvement. Decision Tree Algorithms ID3 and an enhancement – C4.5 Introduced by J.Ross Quinlan, ID3 is one of the very old decision tree algorithms that is built with strong base on inference systems and concept learning systems. Predictors and their splitting values depend on the information gain provided by the splits. Gain here indicates the amount of information required to make a correct prediction before and after a split is made. ID3 has a further version called C4.5 and this improves on areas like predictors with missing , continuous values can still be used, pruning is introduced and rule derivation. CART Classification and Regression Trees – it s a data prediction and exploration algorithm developed by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone. Predictors are picked as they decrease the disorder of the data. Each predictor is selected basing on how the predictions separate the records. One of the advantages of CART algorithm is it has the validation of model. CART builds a complex tree and then prunes it back to the optimally general tree depending on the results of test set validation or cross validation. Depending on the different pruned versions the tree is pruned back. This algorithm is very robust with regard to missing data. A missing data record will not be used to make predictions and it uses the information on hand to make decisions about good split. Missing data is handled through surrogates. They are split values and predictors mimicking the actual split in the tree. CHAID Chi-Square Automatic Interaction Detector --- is similar to CART since it builds its decision tree but in a different way to choose the splits. It relies on Chi-Square test in determining which categorical predictor is furthest from independence with the prediction values. Since CHAID depends on contingency tables to form its test of significance for every predictor, each predictor must be either coerced into a categorical form through binning or categorical. Neural Networks Neural networks are very accurate predictive models applied to large number of various problems. Neural networks are biological systems that detect patterns and make predictions. Artificial Neural Networks are computer algorithms that implement pattern detection and machine learning algorithms to build predictive models from large historical databases. Hence, neural networks belong to the community of Artificial Intelligence rather than statistics. They work in detecting patterns in the same way as human beings do. They do “learn” in a real sense but using the techniques and algorithms and are not very different from other data mining algorithms or statistics. They are automated to an extent where user need not understand how they work or use database or make predictions to use them. Data can be used without much changes made to it. One needs to know – how the nodes in the network be connected? Number of neurons like processing units to be used? When a training be stopped to avoid overfitting? Data that goes into the process needs to be pre-processed mostly, there is a requirement to normalize numeric data to between 0.0 and 1.0. Categorical predictors are to be broken to virtual predictors which would be 1 or 0 for each original categorical predictor. There are no shortcuts in neural networks. They are used in various applications and various business like credit risk prediction, fraudulent use of credit cards. They can also be used for prototype creation and clustering. Clusters are made by compressing the data by algorithms or prototypes that tilt the system to create clusters that overlap as little as possible. They are used in Outlier Analysis, feature extraction. Different types of neural networks There are hundreds of back propagation feed forward neural networks. Much of them deal with changing the architecture to include recurrent connections. Output of the output layer is sent back to hidden layer as an input. They are used in sequence prediction. Recurrent networks are also used in decreasing the time taken to train the neural network. Back propagation uses gradient descent in searching the best possible way to improve link weights to minimize error. Large number of neural networks with different randomly weighted links and selecting the one with the least error rate is the best learning procedure. Back propagation is simple, easy to understand, works in large number of domains. Other neural network architecture are Kohonen feature maps and Radial Basis Function. Kohonen feature maps are used for unsupervised clustering and learning. Radial Basis Functions are used for supervised learning to represent a hybrid between neural network and nearest neighbor. Kohonen Feature Maps Developed in 1970's to simulate few brain functions. They are now mostly used in Clustering and unsupervised learning. They are feed forward neural networks with no hidden layer. They have only input layer and output layer and the nodes in the output layer compete against themselves to show strongest activation to a given record, also called “Winner take all”. Each output node represents a cluster and nearby clusters. A record fall in to one and only one cluster but other clusters in which it might fit is shown next to the best matching cluster. Rule Induction It is most common form of knowledge discovery in unsupervised learning systems. It resembles the way most people think about when they think of data mining ( mining for a rule that reveals something about the database in the vast database). All possible patterns are taken from the data set and to each of these patterns a significance and accuracy are added to indicate use of the strength and likeliness of occurrence. Rules extracted for the database are ordered and presented to the user depending on the number of times they are correct and how frequently applied. Retrieving all possible pattern is the biggest strength of induction rule and also its weakness as it becomes difficult to go through all of them. Equal rules may create conflicting patterns. Which Technique and When? Deciding which data mining technique is a difficult thing. Some criteria important for determining a correct technique is through trial and error. Customer data is constantly changing which mean it can no longer build a perfect model depending on the past and past can't adequately predict future with these. Case Studies How Companies Learn Your Secrets : TARGET Every time you go shopping, you share intimate details about your consumption patterns with retailers. And many of those retailers are studying those details to figure out what you like, what you need, and which coupons are most likely to make you happy. TARGET, for example, has figured out how to data-mine its way into your womb, to figure out whether you have a baby on the way long before you need to start buying diapers. Charles Duhigg outlines in the “ NEWYORK TIMES” how Target tries to hook parents-to-be at that crucial moment before they turn into rampant — and loyal — buyers of all things pastel, plastic, and miniature. He talked to Target statistician Andrew Pole — before Target freaked out and cut off all communications — about the clues to a customer’s impending bundle of joy. Target assigns every customer a Guest ID number, tied to their credit card, name, or email address that becomes a bucket that stores a history of everything they’ve bought and any demographic information Target has collected from them or bought from other sources. Using that, Pole looked at historical buying data for all the ladies who had signed up for Target baby registries in the past. Target ran test after test, analyzing the data, and before long some useful patterns emerged. Lotions, for example. Lots of people buy lotion, but one of Pole’s colleagues noticed that women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date. As Target' statistician's computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy. One Target employee I spoke to provided a hypothetical example. Take a fictional Target shopper named Jenny Ward, who is 23, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s pregnant and that her delivery date is sometime in late August. And perhaps that it’s a boy based on the color of that rug? So Target started sending coupons for baby items to customers according to their pregnancy scores. Duhigg shares an anecdote — so good that it sounds made up — that conveys how eerily accurate the targeting is. An angry man went into a Target outside of Minneapolis, demanding to talk to a manager: “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again. On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.” What Target discovered fairly quickly is that it creeped people out that the company knew about their pregnancies in advance. “If we send someone a catalog and say, ‘Congratulations on your first child!’ and they’ve never told us they’re pregnant, that’s going to make some people uncomfortable,” Pole told me. “We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy.” “Then We started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random. We’d put an ad for a lawn mower next to diapers. We’d put a coupon for wineglasses next to infant clothes. That way, it looked like all the products were chosen by chance. “And we found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.” So the Target philosophy towards expecting parents is similar to the first date philosophy? Even if you’ve fully stalked the person on Facebook and Google beforehand, pretend like you know less than you do so as not to creep the person out. Duhigg suggests that Target’s gangbusters revenue growth — $44 billion in 2002, when Pole was hired, to $67 billion in 2010 — is attributable to Pole’s helping the retail giant corner the baby-on-board market, citing company president Gregg Steinhafel boasting to investors about the company’s “heightened focus on items and categories that appeal to specific guest segments such as mom and baby.” Every major retailer from U.S.P.S to grocery chains to banks has a “Predictive Analytics” department to understand consumer interests, taste , shopping habits so that they can market efficiently to them. “We’re living through a golden age of behavioral research. It’s amazing how much we can figure out about how people think now.” – NY TIMES. The ability to analyze data has grown more and more fine-grained, the need to understand how daily habits do influencing user decisions is now the most exciting topics of interest in clinical search, though many of us are not even aware of those patterns exist. A Study from Duke University estimated that habits, make 45% of choice that we make every day rather than conscious decision-making. This research is also transforming our thinking of how habits function across societies and organizations. Paul O’Neill overhauled a stumbling conglomerate, Alcoa, and turned it into a top performer in the Dow Jones by relentlessly attacking one habit — a specific approach to worker safety — which in turn caused a companywide transformation. The Obama campaign has hired a habit specialist as its “chief scientist” to figure out how to trigger new voting patterns among different constituencies. There is calculus to master our subconscious urges. Exhaustive rendering of our conscious and unconscious patterns into data sets and algorithms has revolutionized what companies know about us and how well they can sell to us. The process in which the brain converts a sequence of actions into an automatic routine is called “Chunking”. There are dozens of behavioral chunks that we depend on every day. Some are simple and others are complex that it's excellent to realize that a habit has emerged at all. Scaling Big Data Mining Infrastructure: The Twitter Experience Schemas play a major part in assisting scientists to understand the petabyte-scale data stores. However, they don't provide the big picture of the data available to develop insights. Also, heterogeneity of different components must be integrated in to production work flows and this process is called Plumbing. The analytics team of Twitter has undergone tremendous changes in few years in terms of size, number of users, complexity and different use cases. Every day, around 100 TB of raw data in sent to the Hadoop jobs collectively. These jobs have everything done from data cleaning , aggregation, report generation, spam detection, follower recommendation and more. Successful data mining is more than what academics consider data mining is. A good amount of infrastructure and tooling is need to convert vague strategic directives to concrete, solvable problems. A analyst spends time performing exploratory data analysis to understand what's there, which has data munging, data cleaning. Academic researchers consider data mining as converting domain insight into training models and features for different tasks is a small part of the overall insight-generation life cycle. Three significant trends differentiate current insight-generation activities to past. First, Amount of data has increases tremendously. Apart from the normal data collected , we now also need to collect behavioral data from users. The revolution of social connection and user generated content, further adds to the amount of data being accumulated. Secondly, sophisticated analyses have been used by companies on their large data base. Majority of the information comes under OLAP (online Analytical processing). ETL are common tasks from different data sources, creating joined views, filtering, aggregation or cube materialization. The output includes front end dashboards, report generators and other visualization tools to encourage “drill down” and “roll up” operations on multidimensional data. Today scientists are interested in Predictive Analytics. Finally, open-source software plays a vital role in current day's ecosystem. No credible open-source enterprise-grade, distributed data analytics platform that handles large volumes of data were there. Now, Hadoop open-source implementation of MapReduce lies at the core of de facto platform for large-scale data analysis, along with other supporting systems like ZooKeeper, Hbase, Hive, Pig andothers. Hadoop is not a reasonable solution for many issues but a strong case can be made for Hadoop supplementing the existing Data management systems. Finally, open-source software is playing an increasingly important role in today’s ecosystem. A decade ago, no credible open-source, enterprise-grade, distributed data an- alytics platform capable of handling large data volumes ex- isted. Today, the Hadoop open-source implementation of MapReduce [11] lies at the center of a de facto platform for large-scale data analytics, surrounded by complementary systems such as HBase, ZooKeeper, Pig, Hive, and many others. The importance of Hadoop is validated not only by adoption in countless startups, but also the endorsement of industry heavyweights such as IBM, Microsoft, Oracle, and EMC. Of course, Hadoop is not a panacea and is not an adequate solution for many problems, but a strong case can be made for Hadoop supplementing (and in some cases, replacing) existing data management systems. Twitter analytics fall at the intersection of these three developments. Twitter favored building Hadoop open-source platform to costly proprietary systems. Also, its analytics range from simple aggregations to training machine-learned models. THE BIG DATA MINING CYCLE Exploratory Data Analysis Data quality issues are shown in exploratory data analysis. Data cleaning has sanity check , a technique used for service that records both component count and aggregation counts to ensure that total of aggregate counts is equal to the total of component counts. Abrupt shifts in the data characteristics are during sanity checking. Twitter data set is so complex that one person can't tackle all issues. Hence, abrupt shifts in data sets is quite common and is a time taking activity that need cooperation from all teams. Though the logs are right, there could be many outliers caused by use cases, that could be pinned to non human actors. Data Mining After exploratory data analysis, we can formulate the problem precisely with respect to the context of data mining task and define values for success. The data analyst or scientist now gathers and tests data. For Twitter, we can check data from past up to a number of weeks we wish to and predict if a user is still active. Next step is machine learning ad feature extraction. Tens of terabytes of log data is distilled into compact sparse feature vectors and from there make a classification model. This is achieve in Twitter through Pig Scripts, compiled into physical jobs that are executed as hadoop jobs. Classifier are refined iteratively using standard practices like feature selection, cross validation, tuning of model parameters and so on. After the classifiers are refined properly, it is then evaluated in a prospective manner with the help of current data and checking prediction accuracy for few weeks from now. This makes sure that we didn't provide any future information to the classifier. Data scientists improve the algorithms used incrementally depending on feed back form user behavior after the product is launched. Improvement is done through simple parameter tuning or complex way of experimenting with various algorithms. Most algorithms combine various techniques. Refinements in Twitter are done through A/B testing which is seen most commonly in other organizations. We need additional tools to support A/B testing in big data mining scenario. Additional tools could be recognizing the user buckets, keeping track of treatment mappings and also threading the user token in all analytics processes in order to break down the results in every condition. Successful deployment of a data product gives rise to a new problem. Predicting user activity doesn't affect user growth. We need to act on the output of the classifier and measure their effectiveness. Hence, a big data mining problem leads to another problem and the cycle continues. There are two people working together complementing each other in this production context. One is the infrastructure engineers building the tools and get the operations done and on the other hand, there are data scientists using tools to for data mining to get some insights. The roles could be overlapping or distinct. Twitter has the data science and analytical infrastructure as two different but closely integrated groups. Activities like machine learning and feature extraction are important tasks of data mining but make up only a small part of the data mining cycle. There are many stages that precede formulating the problem, data cleaning and exploratory data anlaysis. Also, there is so much that follows the predictions like deploying the solution in production and continuous maintenance. Every data mining infrastructure should support all of these activities rather than just executing data mining algorithms. Understanding the big data mining cycle helps us to better solve the real world problems. Complex models with large data shifts and parameters are not easy to maintain. All the models need to updated frequently or incrementally refine, rather than those that need to be trained form start every time. PLUMBING Building and operating a production analytics is a big challenge due to the mismatches from overlapping of various systems and frameworks. Choosing the right tool for the problem is important since every system or framework is good at different problems. This has to be balanced with the cost of getting together the various components into integrated workflows. Various frameworks provide alternative models and ideas about computation. MapReduce allows the analyst to break everything into maps and reduces. Pregel works in terms of vertex computations and message passing. SQL and schemas constrain relational databases. Another challenge faced while integrating different systems and frameworks for big data mining process is threading data flows across multiple interfaces. Every job can't run in isolation and it is a part of some bigger workflow. There could be numerous dependencies and data could be generated from upstream and could finally terminate from external sources at data imports. Analytics job feeds downstream process, goes in depth and finally in a way that data is presented dashboards, deployed back out to user-facing services. All jobs happen like a clock work, importing data at regular intervals, reports and dashboards are up to date, data to be refreshed frequently otherwise, we could show stale data. Ad Hoc Data Mining Solution to a data mining problem has three main components: raw data, feature representation extracted from data and model algorithm used to solve the problem. Past experience indicates that size of the data set is very important factor among the three. Simple models that are trained to work on huge data sets perform better than sophisticated models trained to work on less data. Solutions to various problems are give by simple algorithms with large amounts of data. For unsupervised data mining , more data is fed to the problem so that we can make sense from large stored data stores. Scaling up machine learning algorithms with cluster based solutions and multicore are of much interest these days. Mahout 16 is popular tool kit for large machine learning and data mining tasks. Most of the work focuses on scalable machine learning rather than integration issues. Few components of Mahout run efficiently on single machine and others scale to huge data sets via Hadoop. Mahout processing has many monolithic multi stage pipelines. It needs data to be formatted in a particular format and results are custom presented. Integration of Mahout into our analytics needs adaptors to get data in and results out of Mahout. Approach in Twitter problem would be integrating machine-learning components into pig. Pig script allows for seamless integration of existing infrastructure for scheduling, data management and monitoring and also access to libraries of UDFs and materialized output of other scripts. This is achieved through two techniques stochastic gradient descent and scale out by partitioning the data. Pig extensions have been developed to induce learners into storage functions, that are abstractions materializing tuples themselves. Challenges of Big Data Management Analytic Architecture How an optimal architecture has to deal with historical data and real-time data together is unclear. Lambda architecture of Nathan Marz is an interesting proposal. It decomposes the real time problem into three layers: batch layer, Serving layer and speed layer. Fault tolerant, robust, general, extensible , scalable, adhoc queries, debuggable and minimal maintenance are few of the system's features(Fan, W., & Bifet, A. (2013). Statistical significance we need significant statistical results rather than randomness. It is easy to commit mistakes with large data sets and thousands of questions(Fan, W., & Bifet, A. (2013). Distributed Mining Many data mining techniques are not trivial to paralyze. To obtain distributed versions of a model, a lot effort and research with both theoretical and practical analysis to come up with new methods(Fan, W., & Bifet, A. (2013). Time Evolving data As time progresses, more data evolves and hence makes data mining techniques to be able to adapt to this and detect changes. Data stream mining field is a powerful technique that fulfills this task(Fan, W., & Bifet, A. (2013) Compression Amount of space needed to store data is important. Two main methods of storing data are Compression (no data is lost) and sampling (chose some data as a representative). Compression uses more time and less space We lose information in sampling but space gains can be in the order of magnitude. For example, coersets are small sets that approximate the original data. Merge reduce can then be use these small sets to solve hard machine learning problems simultaneously(Fan, W., & Bifet, A. (2013). Visualization Visualizing results is a big task in data mining . Since the data is very large, it is difficult to get user-friendly visualizations. New techniques would be need to tell and show stories(Fan, W., & Bifet, A. (2013). Hidden Big Data Large amounts of data is lost as new data is unstructured and untagged file based. The 2012 IDC Study on Big data explains that 23% of digital universe would be useful for big data if tagged and analyzed. Only 3% of the potentially useful data is tagged now(Fan, W., & Bifet, A. (2013). Conclusion There is a continuous increase of Big data every year and data scientists need to manage much more every year. Data would be larger, diverse and faster. Big data is now the new final frontier for data business application and scientific research. We are beginning to enter the new era of big data mining that would help discover more predictions that none has seen before(Fan, W., & Bifet, A. (2013).

Friday, August 17, 2012

Nokia Lumia 800 features, reviews and price Nokia Lumia 800 is a result of the tie up between Microsoft and Nokia. It is the very first windows handset that comes with a metro interface. The handset comes packed in a little box that includes standard accessories like USB cable, manual charger, earphones along with the mobile. Contrary to the normal opinion that Nokia doesn’t give much importance to the design of the handset, this mobile comes with a sleek look. The right side of the mobile has buttons for volume, lock, switch on and off and a button for camera capture. The handset is expensive in comparison with other Nokia windows handsets and the design looks similar to Nokia N9. AMOLED screen has much vibrant and brighter colors. There is a sharp and nice pixel quality however it’s no better than few of the HTC handsets. 16 GB of internal storage is much higher than Samsung or HTC mobiles. The SIM port and charger point come on the very top of the mobile which makes battery inaccessible. Removing the battery is little difficult and hence has to be handled with much care. Also, the charge doesn’t stay for more than one day. The sleek and slim design, a non-reflective display and quick and immediate response from the OS are the pros of the mobile and the cons being quality of the camera not being good under dim light conditions, lack of dual facing camera and few issues concerning Bluetooth connectivity. On an overall note, the handset is good and comes with a price of around 21K in India.

Friday, September 3, 2010

Android was acquired by Google, a small company in July 2005. It was located in Palo Alto, CA. Then, not much was known to everyone regarding the Android functions and just knew that they were making mobile phone software. Rumors then started that Google was planning to step in the cell phone market not clear of their function in the market.
There is large group of developers who write applications called apps extending the devices functionality. We can now see more than 200,000 apps that are available for Android. The market for Android is an online application store that's run by Google, although these apps can even be downloaded from other third-party websites. Apps are written mostly in Java Language, which control the device through Java libraries developed by Google.
Android distribution was introduced on 5 Nov 2007 by creating a Open Handset Alliance, that's an association of around 80 software, telecom and hardware companies that are dedicated in advancing standards for mobile phones. Most of the code for the Android was released by Google using the Apache License, a open source and free software license.
The open-source Android software stack includes Java applications that run on the framework of object-oriented applications, Java that are based on top of Core libraries of Java which run on Dalvik, a Virtual machine that features JIT compilation. C is used to write libraries that consist opencore media framework, surface manager, openGL ES 2.0 3D graphics API, SQLite RDMS, SGL graphics engine, Webkit layout engine, Bionic libc and SSL. Operating System for Android includes Linux Kernel which contains around 12 millions lines of code with 3 million lines in XML, 2.1 million in Java, 2.8 million in C and 1.75 million lines in C++.

Tuesday, September 23, 2008

Friday, September 12, 2008