Write the code to read the raw training data into the data structure in the first approach described in the section called “The Raw Data”. That is, the data structure is a data frame with a column for each MAC address that detected a signal. For the column name, use the last two characters of the MAC address, or some other unique identifier.

Compare the size of two data structures: the data frame created in the section called “The Raw Data” and the data frame created in the previous problem. Which uses less memory? What is the dimension of each? How might this change with different numbers of devices in the building? different number of signals from the less commonly detected devices? Use

*object.size()*and*dim()*to address these questions.Compare the time taken to read the raw data and create the data frame using the two approaches described in the section called “The Raw Data” both approaches? Do this for different size subsets of the data (chosen at random) and draw a curve of input size and time taken to read the data? Comment on the memory and speed for the two approaches. Use

*system.time()*and*Rprof()*to make these comparisons.Examine the

**time**variable in the offline data. Any change over time in the characteristics of the signal caused by, e.g., reduced battery power in the measuring device as time goes by, or measurements taken on different days may be made by different people with different levels of accuracy. Also, examination of**time**can give insight into how the experiment was carried out. Were the positions close to each other measured at similar times? Do you see any change in the signal strength variability or mean over time? Try controlling for other variables that might affect this relationship.Write the

*readData()*function described in the section called “Creating a Function to Prepare the Data”. The arguments to this function are the file name,*filename*and the MAC addresses to retain,*subMacs*. Determine whether these parameters should have default values or not. The return value is the data frame described in the section called “Cleaning the Data and Building a Representation for Analysis”. Use the*findGlobals()*function available in*codetools*to check that the function is not relying on any global variables.In the section called “Distribution of Signal Strength” we calculated measures of center and location for the signal strengths at each location angle access point combination. (See for example Figure 1.9, “Comparison of Mean and Median Signal Strength”.) Another possible summary statistic we can calculate is the Kolmogorov-Smirnov test-statistic for normality. If the signal strengths are roughly normal, then we expect the -values to have a uniform distribution. This leads to about 5% of the -values for the 8000 tests to fall below 0.05.

Write the

*surfaceSS()*function that creates plots such as those in Figure 1.10, “Median Signal at Two Access Points and Two Angles”. This function has three arguments:*data*for the offline summary data frame, and*mac*and*angle*which supply, respectively, the MAC address and angle to select the subset of the data which we want smoothed and plotted.Consider the scatter plots in Figure 1.11, “Signal Strength vs. Distance to Access Point”. There appears to be curvature in the signal strength – distance relationship. Does a log transformation improve this relationship, i.e., make it linear? Note that the signals are negative values so we need to be careful if we want to take the log of signal strength.

The floor plan for the building (see Figure 1.1, “Floor Plan of the Test Environment”) shows six access points. However, the data contain seven access points with roughly the expected number of signals (166 location 8 orientations 110 replications 146,080 measurements). With the signal strength seen in the heat maps of Figure 1.10, “Median Signal at Two Access Points and Two Angles”), we matched the access points to the corresponding MAC address. However, two of the MAC addresses seem to be for the same access point. In the section called “Exploring MAC Address” we decided to keep the measurements from the MAC address

`00:0f:a3:39:e1:c0`

and to eliminate the`00:0f:a3:39:dd:cd`

address. Conduct a more thorough data analysis into these two MAC addresses. Did we make the correct decision? Does swapping out the one we kept for the one we discarded improve the prediction?Write the

*selectTrain()*function described in the section called “Choice of Orientation”. This function has three parameters:*angleNewObs*, the angle of the new observation;*signals*, the training data, i.e., data in the format of**offlineSummary**; and*m*, the number of angles to include from**signals**. The function returns a data frame that matches**trainSS**, i.e.,*selectTrain()*calls*reshapeSS()*(see the section called “Choice of Orientation” for this function definition).We use Euclidean distance to find the distance between the signal strength vectors. However, Euclidean distance is not robust in that it is sensitive to outliers. Consider other metrics such as the distance, i.e., the absolute value of the difference. Modify the

*findNN()*function in the section called “Finding the Nearest Neighbors” to use this alternative distance. Does it improve the predictions?To predict location, we use the nearest neighbors to a set of signal strengths. We average the known values for these neighbors. However, a better predictor might be a weighted average, where the weights are inversely proportional to the ``distance'' (in signal strength) from the test observation. This allows us to include the points that are close, but to differentiate between them by how close they actually are. The weights might be for the -th closest neighboring observation where is the distance from our new test point to this reference point (in signal strength space). Implement this alternative prediction method. Does this improve the predictions? Use

*calcError()*to compare this approach to the simple average.In the section called “Cross-Validation and Choice of ” we used cross-validation to choose , the number of neighbors. Another parameter to choose is the number of angles at which the signal strength was measured. Use cross-validation to select this value. You might also consider selecting the pair of parameter, i.e., and the number of angles, simultaneously.

The researchers who collected these data implemented a Bayesian approach to predicting location from signal strength. Their work is described in a paper which is available from http://www.informatik.uni-mannheim.de/pi4/publications/King2006g.pdf . Consider implementing this approach to building a statistical IPS.

Other statistical techniques have been developed to predict indoor positions from wireless local area networks. These include[bib:Krishnan, bib:Madigan and bib:Youssef]. Consider employing one their approaches to building and testing a statistical IPS with the CRAWDAD data.