An Analysis of Broadband Access: American Community Survey Data Assignment Sample

Question Synopsis

The assignment is to complete a data analysis project in R and document it in RMarkdown. The topic has to be selected by students, hypotheses are to be formulated, methods are to be described (data source, sample, measures, analysis), results are to be provided (descriptive statistics, hypothesis test) and lastly, the findings are to be discussed. Possible requirements of the project are R data cleaning, relevant statistical tests, figures, and tables. The final product is a manuscript in HTML format that exhibits proper formatting whereby the use of RMarkdown. Some of the criteria include how the code is executed, decision-making that took place in the analysis, and the way the final presentation is composed and whether it has a logical flow. Also, students have to make a recorded PowerPoint presentation of 5-10 minutes on the summary of the project with one slide per the report section.

Subject Name : Management

Synopsis

Answer Synopsis

The response would be an R Markdown document starting with the Introduction section followed by Hypotheses, Methods, the Results, and the Discussion sections, and R code for pre-processing data and analysis plus visualizations. The Results section would contain identification of tables, charts and the results of statistical tests with their explanation. According to Grant, Jenkins and Kunkel, the Discussion would situate results and respond to the limitations. The presentation would include a PowerPoint of the project including a brief one page for each of the main sections of the proposal and a final page with the more detailed findings. This approach shows the student’s capacity to design, execute, interpret, and present a complete research study utilizing both quantitative and presentation competencies.

Introduction

The link between broadband internet connection and housing characteristics in an era where digital connectivity is as crucial as utilities presents a mixed picture of modern American life.
Broadband access and US housing typologies are examined in this research. Using ACS PUMS data, we examine the association between internet infrastructure, housing type, number of bedrooms, and building structure in various localities.
The research seeks to define the “digital divide” and discuss equitable technology access, a key indicator of socioeconomic growth. This project aims to improve understanding of infrastructure gaps and support digital inclusion.

Hypothesis

One idea for studying internet availability and US house qualities is:

We believe there is a strong association between broadband internet access and building type, bed count, and housing unit type. Broadband connectivity will be better in locations with more modern multi-bedroom houses. This association is due to socioeconomic factors that impact technical infrastructure and house quality. Broadband access will vary by socioeconomic status and urban-rural divide.” This hypothesis allows statistical correlation analysis of the dataset.

Methods

Our research employs statistical analysis to examine the association between broadband internet access and home features throughout the US using data from the American Community Survey Public Use Microdata Sample (ACS PUMS). The initial step of data cleaning and processing removes observations with missing values to ensure analysis integrity. After that, we use descriptive statistics to identify the dataset’s central tendency, dispersion, and distribution shape.
Our analytical method centers on hypothesis testing. The chi-square test of independence determines whether categorical variables—such as geographical location and internet access—correlate. This non-parametric test was chosen for its reliability in assessing relationships without regularly distributed data.
We use a t-test to compare the weighting variable ‘WGTP’ across broadband access levels in addition to categorical analysis. This test will show whether broadband availability affects survey response weight.
We utilize many graphics to explain and illustrate. Histograms, density graphs, and boxplots depict the weighting variable’s distribution over broadband categories and regions. We may use a faceted histogram to examine broadband category weight distributions separately. A correlation matrix heatmap shows the relationships between all numerical variables in our dataset.
Every analytical step—from inferential testing to descriptive statistics—is carefully chosen to ensure we answer our research questions and analyze the data rigorously and clearly. Besides being analytical tools, visualizations help us communicate our findings to academics and regional economic and digital infrastructure policy stakeholders.

Descriptive Analysis

The output from the summarise() function in R indicates an attempt to calculate descriptive statistics across all variables in the dataset, including measures such as the mean, standard deviation (sd), median, and interquartile range (IQR). However, warnings were generated because some variables are not numerical (e.g., ‘RT’, ‘SERIALNO’), and therefore, functions like mean and sd cannot be applied. The NA values in the output suggest that the mean and standard deviation could not be computed for these non-numeric variables. To resolve this, one should only apply these summary functions to appropriate numeric variables. The provided statistics for numeric variables, like ‘WGTP’, indicate the central tendency and dispersion, which are crucial for understanding the dataset’s distribution.

Contingency table

The output from the table() function in R indicates that a contingency table was created for two categorical variables: ‘REGION’ and ‘BROADBND’. In this table, ‘4’ represents the region code and ‘1’ and ‘2’ represent the categories for broadband access. The numbers 20475 and 1513 are the counts of observations for each category within the region ‘4’.

Chi-squared test

The chi-square test of independence that follows this step will evaluate whether there is a statistically significant association between the region and the availability of broadband. If the p-value from the chi-square test is less than the chosen significance level (commonly 0.05), it would suggest that the distribution of broadband categories is not independent of the region, implying a possible relationship between the geographic region and broadband access.
The output of the chisq.test() function in R shows the results of the chi-square test of independence. The chi-squared statistic is approximately 16352 with 1 degree of freedom, and the p-value is less than 2.2e-16 (which is virtually zero).
A chi-squared statistic this large, accompanied by such a small p-value, strongly suggests that there is a significant association between the region (specifically region ‘4’ in this case) and the availability of broadband access. The extremely low p-value indicates that the likelihood of observing such a relationship by chance is extremely low, far beyond the conventional threshold for statistical significance (typically p < 0.05). Therefore, we would reject the null hypothesis of independence and conclude that in the region under study, broadband access is not distributed independently of the geographical region.

Histogram

The histogram displayed represents the frequency distribution of the variable ‘WGTP’, which appears to be heavily right-skewed. This suggests that the majority of the data points lie towards the lower end of the ‘WGTP’ scale, with a decreasing number of observations as the value of ‘WGTP’ increases. The tail extending to the right indicates the presence of outliers or a small number of areas with much higher weights compared to the rest. Such a distribution could imply that most survey responses have a lower weighting, with fewer responses being assigned higher importance, possibly reflecting an uneven sampling frame or a concentration of responses from certain demographics or geographic areas. The skewness of the data may require transformations for certain types of statistical analysis.

Boxplot

The boxplot illustrates the distribution of the ‘WGTP’ variable for region 4. The box, representing the interquartile range (IQR), shows the middle 50% of the data is tightly clustered, indicating less variability within the central portion of the dataset. The line within the box denotes the median, which appears to be on the lower end of the range, suggesting a skew in the data. There are several points above the upper whisker, which extends significantly from the top of the box, indicating the presence of outliers. These outliers suggest that there are values of ‘WGTP’ that are much higher than the typical range observed in the dataset. The absence of any points below the lower whisker suggests there are no extreme low-value outliers. Overall, the distribution is skewed with a few areas having much higher weights, which could be indicative of certain areas being overrepresented in the survey’s sampling design or possessing characteristics that warrant a higher survey weighting.

Scatter Plot

The scatter plot compares ‘WGTP’ (weighting) and ‘BROADBND’ (broadband access), likely coded as 1 for access and 2 for no access. Data points are clustered at the 1 and 2 positions on the x-axis, which correspond to the binary nature of the ‘BROADBND’ variable. Most ‘WGTP’ values are concentrated at the lower end for both broadband categories, but there are several outliers indicating higher weights. The presence of outliers is particularly pronounced for the ‘no access’ category. This visualization suggests that while there is a general pattern of lower weights across the dataset, there are exceptional cases with significantly higher weights. The plot does not immediately suggest a clear relationship between broadband access and the assigned weights; the variation within each category of broadband access appears similar.

Correlation

The heatmap depicts the correlation matrix of various variables, with the color intensity indicating the strength and direction of the correlation—red for positive, blue for negative. The darker the shade, the stronger the correlation. Diagonal red squares show a perfect positive correlation, as they represent the relationship of each variable with itself.
The off-diagonal squares reveal how each pair of variables is related. Notably, some pairs exhibit significant positive correlations (darker red), while others have lesser or no correlation (white or light-colored squares). The presence of any blue squares would suggest negative correlations, but they seem to be absent or very light, indicating weak negative relationships if present.
Interpreting this heatmap requires a closer look at specific variable pairs to understand their relationships. For instance, variables that are closely related to each other might be used in regression analysis to predict one from the other. Conversely, variables with little to no correlation might be considered independent factors in the context of the study.

Density plot

The distribution is not symmetrical and is heavily weighted towards the lower end of the ‘WGTP’ scale. This pattern is typical of variables where a large proportion of the data clusters around a low range of values, but with some instances of much higher values. The presence of such a skew may influence statistical analyses and could be indicative of underlying disparities in the sample, such as certain groups being over- or under-represented in the dataset.

The faceted histogram presents the distribution of ‘WGTP’ (weighting) across two categories of broadband access, labeled ‘1’ and ‘2’, and a third category ‘NA’ for missing data. Both categories ‘1’ and ‘2’ show a right-skewed distribution with a majority of the data concentrated at the lower end of the weighting scale and a long tail stretching towards higher values. This indicates that within each category, most areas have a lower weight, with a few exceptional higher-weight areas.
The presence of a third ‘NA’ facet suggests there are records in the data with missing information regarding broadband access. This facet’s single bar indicates a concentration of records with a specific ‘WGTP’ value, which may need further investigation to understand why this particular weighting is common among records with missing broadband data.
Comparing the first two facets, the pattern of distribution appears similar, suggesting that the relationship between ‘WGTP’ and broadband access might not differ greatly between the two categories. However, the actual counts and the range of ‘WGTP’ values could provide more context on the prevalence and weighting of broadband access within the surveyed population.

Box plot

The boxplot visualizes the distribution of ‘WGTP’ (weighting) across different categories of broadband access, labeled ‘1’, ‘2’, and ‘NA’ for missing data. For categories ‘1’ and ‘2’, the boxes, which represent the interquartile range (IQR), are relatively narrow, indicating that most of the data within these categories are clustered around the lower end of the scale. This is consistent with a positive skew, as seen in the histograms and density plots.
Notably, the median, indicated by the line within each box, is closer to the bottom of the box, which confirms the skewness. There are numerous outliers for both categories, as shown by the points beyond the whiskers, particularly in the ‘2’ category, suggesting some areas have significantly higher weights.
The ‘NA’ category shows a similar distribution to the ‘2’ category but with fewer outliers, suggesting that where broadband access data is missing, the distribution of weights is less extreme.
These boxplots can be useful for identifying differences in the spread and central tendency of ‘WGTP’ across the categories of broadband access. The similarity in the spread of ‘WGTP’ between the broadband access categories could imply that weighting is not strongly dependent on broadband access status. However, the outliers suggest there are exceptional cases in each category that may require further investigation.

Average Comparison

The prop.table output indicates the proportion of ‘BROADBND’ (broadband access) across different ‘PUMA’ (Public Use Microdata Area) codes. The values are proportions within each PUMA, with the majority showing a higher proportion of category ‘1’ (likely representing access to broadband). For example, in PUMA 100, approximately 88.99% have broadband access (‘1’) compared to 11.01% without (‘2’).
The summarize function output demonstrates that the average ‘WGTP’ (weight) differs across the ‘BROADBND’ categories. Those with broadband access (‘1’) have a higher average weight (98.6) compared to those without (‘2’), who have an average weight of 87.6. The ‘NA’ category has the lowest average weight of 68.9, indicating that responses with missing broadband data tend to have a lower weight.\

Chi- Square result

The chi-square test result, with a chi-squared value of 16352 and a p-value less than 2.2e-16, suggests a highly significant relationship between broadband access and region. This means that broadband access varies significantly across different regions, which could be due to various factors such as urbanization, economic development, or policies affecting infrastructure deployment. The chi-squared test strongly rejects the null hypothesis of independence between ‘BROADBND’ and ‘REGION’.

T-Test to compare means

The Welch two-sample t-test compares the average weights (‘WGTP’) between two groups defined by broadband access (‘BROADBND’), coded as ‘1’ and ‘2’. The t-test result shows a t-value of 5.9079, indicating a significant difference between the two groups. With degrees of freedom approximately 1806.3 and an extremely small p-value (4.132e-09), we reject the null hypothesis that the two groups have equal means. The 95% confidence interval (7.348838 to 14.652908) for the difference in means does not contain zero, further confirming the statistical significance. In practical terms, there is a significant difference in the weighting given to responses from areas with broadband access (mean of 98.64991) versus those without (mean of 87.64904).

Summary

The analyses conducted on the ACS PUMS dataset reveal significant findings related to broadband access across various regions and its associated weights (‘WGTP’). The prop.table results display a clear majority having broadband access within most PUMA codes. However, there is variability, with some areas showing nearly 97% access, while others are closer to 85%. The average ‘WGTP’ across broadband categories indicates that those with access (‘1’) have a higher average weight compared to those without (‘2’), suggesting potential disparities in the representation or characteristics of these areas.
The Welch two-sample t-test strengthens these observations by statistically confirming that the average weights are significantly different between the two broadband groups. With a p-value far below the conventional threshold for significance, the data provides strong evidence that broadband access is associated with the weight values assigned within the survey, possibly reflecting the survey’s emphasis on adequately representing areas based on their broadband connectivity.
The chi-square test further corroborates a significant association between broadband access and regions, highlighting regional disparities in infrastructure. The magnitude of the chi-square statistic and the resulting p-value emphasize a robust relationship that cannot be attributed to chance. These insights could inform policy directions, indicating a need for targeted efforts to bridge the digital divide.