This is my first draft.
I want my paper in latex format in Overleaf. Not only change the feedback above, but I also want my paper to be more credible.
I also attached the sources and used the figures in r.
The paper content should be:
Title and keywords are well-thought and appropriate for the content.
The abstract is brief and informative. It makes sense both in isolation and as a summary
of the paper.
Introduction:
Addresses the three questions that need to be answered.
The background literature and study rationale are clearly articulated. The references of
current/related works are included.
The hypotheses follow logically from previous work.
There is a roadmap at the end of the section to help the reader navigate through the
remainder of the manuscriipt.
Data:
The data descriiption is clear and properly sourced.
The descriiptive and visual analyses are appropriate and meaningful.
Methods:
The statistical methods are appropriate for the research question.
The statistical notation is appropriate and clearly defined.
The assumptions of the statistical methods are clearly articulated.
The methods for checking any assumptions are clearly stated or described.
Results:
The statistical methods are conducted correctly.
The assumptions of the statistical methods are evaluated.
Violations of assumptions, if any, are handled appropriately.
The presentation of results is clear and accessible.
Discussion/Conclusion:
The discussion of the results links back to the substantive topic in the introduction.
The storyline is clear and backed up by the analyses.
The limitations of the study are acknowledged.
Future directions of the project are discussed.
Revision:
The paper has been revised per the peer, grader, and instructor comments.
I got the feedback from the professor:
1. Your stated goal is to find which variables have important effect on medical expenses and make prediction, but you didnt do that through your paper.
2. Lines 165- 179: need to reorganize this part. You mentioned the simple linear regression starting at line 165, and need to mention which equation used for simple linear regression between line 168- 170. But after you introduced how you construct the simple linear, you go back to talk about the transformation of your multiple linear regression.
3. Line 179: there is outlier above 1, but this plot contain less outlier than the original plot This sentence does not seem like it is supposed to appear, please indicate which graph has outliers higher than 1 and which is the original graph.
4. Line 180-185: The parameters of the model are the intercept (?) and the slope (?).. which model? Which equation did you use alpha as intercept?
5. Figure 5. Based on your formula lm(initial_year_cost ~ Cancer site), your response variable is initial_year_cost, it is not consistent with your equation for simple linear regression
6. Line 208: The F test you state below is for overall significance.
7. Line 200 and table 1, you mentioned the t statistic for each beta, what information you can get from these values? Any variable is significant?
8. Line 217-220: You are talking about how to get the test statistic, so how? What is the formula? You only have the formula for standard error and estimated regression coefficients.
9. Line 243 249: You have mentioned this coefficients value in the Method part already.
10. Did not check the assumption for both multiple and simple linear regression.
11. You did the transformation for charge in Method part, any result from transformed data?
12. Need to rewrite the Method and Result part, make it be organizable, readable and easy to understand. Some results in the Method part should move to Result section.
Minor:
1. Change the caption of your figures, do not start with This is
, write a clear, informative caption for all your figures.
2. Line 135: beta_1 beta_6, beta_0 is the intercept
3. Line 145- 149: confused here. Where are the three horizontal lines?
4. Line 152 159, You can write down your fitted regression equation. Same for line 189-198.
5. Line 150: ..residuals close to 2000 dont understand here.
6. Line 222, equation 6, numerator, beta_1, sum_{i=1}^n.
7. Line 162 163: The majority of forecasts were between $2,850 over and $1,400 under the true value since 50 % of errors dont understand.
8. Line 173: 173 The hostogram, histogram, residual vs fitted, and normal Q-Q plots label which figure you mentioned here.
9. Line 238, The p-value is 2.2e-16.. where did you got this p-value? Is that for the F test you mentioned in Method part?
Instructor: I agree with all grader major and minor comments. Additionally:
Major:
1. Roadmap at the end of section 1 is missing
2. Page 6 line 133: did you literally combine the datasets into one? If yes, this would not be appropriate. A second dataset was not mentioned in your proposal so I am not sure why or how this data is being used.
3. Page 7 equation 1 and Page 8 equation 2: Region is a 4-level nominal categorical variable, so it cannot be modeled through a single parameter (beta_6), but would instead need to have three parameters associated with it (plus intercept). I question all interpretations and results regarding region if based on a single regression coefficient.
4. Methods and Results section: as noted by the grader, better separation is needed between and Methods and Results sections, and also better organization is needed within each of these sections.
Each result (which should appear in the results section needs a corresponding method stated/described in the Methods section. E.g., Figure 3 represents a result, so should appear in the Results section; the descriiption of the plot, however, should be in Section 3 (more on this in the comment below). Likewise lines 152-164 should all go in the Results section.
Within section, reorganize so that text pertaining to a single model is group together. Currently your text bounces between untransformed multiple linear regression, transformed multiple linear regression, and simple linear regression, and it is often not clear what model you are referring to when you are reporting the results.
5. Page 7 lines 145-149: which plot are you referring to here? This descriiption does not appear to correspond to Figure 3.
6. Page 9 lines 165-172, equation 3, lines 180-185: I am not sure how you are using simple linear regression here Cancer site a nominal categorical variable and therefore cannot be used as the dependent variable for simple linear regression. Figures 5 and 9 would indicate the reverse relationship, but this still should not be a simple linear regression since cancer_site is still a nominal categorical variable. (Figure 5 also suggests a transformation of initial charges is needed). Everything pertaining to simple linear regresion in this draft needs to be carefully scrutinitized before submitting the final version.
7. Page 11 lines 208-210 and equation 4: as noted by the grader, equation 4 is an overall F test for significance of the regression relation in the multiple linear regression model; this is not an F test for equality of variance.
8. Pages 13-14, Figures 6 7: If the caption for Figure 6 is correct, this plot is not meaningful and should be replaced with something that is meaningul. Patients condition is a nominal categorical variable; it should not be normally distributed. In the text below the Figure 6, you correctly state that you normality of residuals, which is what you need to check. I am not sure what is being checked for normality in Figure 7, but you would need to check regressions assumptions for all regression models you fit.
9. Pages 15-16 lines 257-262: as also noted by the grader, you have not discussed predictions at all in your Results section. If this is your emphasis, you need to more clearly examine this in the results section.
Minor:
Page 2: alphabetize keywords
Page 2 line 43: Likely a missing citation at the question mark at the end. A citation would be needed for this paragraph, regardless.
Page 3 lines 48-50: citation needed
Page 3 line 53: rephrase: According to Morid et al. (2019), chronic health conditions are long-term medical
.
Page 3 line 60: replace looked up the with found that
Page 3 lines 71-73: by we, do you mean insurance companies?
Page 4 line 77: citation should be parenthetical (use citep{}) and place inside the preceeding sentence (before the period)
Page 4 line 93: remove test
Page 4 lines 96-97 and page 5 line110: rephrase Thus, we will likely need to transform charges for subsequent analyses.
Page 5 line 99: remove , with features that reveal details about (repeat text)
Page 5 line 101: - the patient contributes data voluntarily.
Page 6 line 115: citation needed
Page 7 equation 1 and page 9 equation 3: The equation needs to be incorporated as part of a sentence; it cannot stand alone. See also grader comments about beta_0 and alpha, and my major comments regarding region and the simple linear regression model.
Page 7 line 142: replace variation with difference
Page 9 line 179: remove
Page 12 Table 1: a column of numerical information is missing: you should report the coefficient estimates, their standard errors, and the t-value.
Page 12 remove equation (5) through line 225 these formulas are not needed or relevant.
General 1: Use cross-referencing to refer to each specific Figure and Table within the text.
General 2: all displaymath equations need to be punctuated.
General 3: only label (number) equations that you refer to elsewhere in the text.