statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. For example:
In [3]: import statsmodels.api as sm
In [4]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-4-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
/build/buildd/statsmodels-0.5.0+git13-g8e07d34/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
248 "master/doc/"+package+"/rst/")
249 cache = _get_cache(cache)
--> 250 data, from_cache = _get_data(data_base_url, dataname, cache)
251 data = read_csv(data, index_col=0)
252 data = _maybe_reset_index(data)
/build/buildd/statsmodels-0.5.0+git13-g8e07d34/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
178 url = base_url + (dataname + ".%s") % extension
179 try:
--> 180 data, from_cache = _urlopen_cached(url, cache)
181 except HTTPError, err:
182 if '404' in str(err):
/build/buildd/statsmodels-0.5.0+git13-g8e07d34/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
169 # not using the cache or didn't find it in cache
170 if not from_cache:
--> 171 data = urlopen(url).read()
172 if cache is not None: # then put it in the cache
173 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout)
125 if _opener is None:
126 _opener = build_opener()
--> 127 return _opener.open(url, data, timeout)
128
129 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
402 req = meth(req)
403
--> 404 response = self._open(req, data)
405
406 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
420 protocol = req.get_type()
421 result = self._call_chain(self.handle_open, protocol, protocol +
--> 422 '_open', req)
423 if result:
424 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
380 func = getattr(handler, meth_name)
381
--> 382 result = func(*args)
383 if result is not None:
384 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1220
1221 def https_open(self, req):
-> 1222 return self.do_open(httplib.HTTPSConnection, req)
1223
1224 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req)
1182 except socket.error, err: # XXX what error?
1183 h.close()
-> 1184 raise URLError(err)
1185 else:
1186 try:
URLError: <urlopen error [Errno -2] Name or service not known>
In [5]: print duncan_prestige.__doc__
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-9b4cf6ceaa3f> in <module>()
----> 1 print duncan_prestige.__doc__
NameError: name 'duncan_prestige' is not defined
In [6]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
get_rdataset(dataname[, package, cache]) | download and return R dataset |
get_data_home([data_home]) | Return the path of the statsmodels data dir. |
clear_data_home([data_home]) | Delete all the content of the data home cache. |
Load a dataset:
In [7]: import statsmodels.api as sm
In [8]: data = sm.datasets.longley.load()
The Dataset object follows the bunch pattern explained in proposal.
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [9]: data.endog[:5]
Out[9]: array([ 60323., 61122., 60171., 61187., 63221.])
In [10]: data.exog[:5,:]
Out[10]:
array([[ 83. , 234289. , 2356. , 1590. , 107608. , 1947. ],
[ 88.5, 259426. , 2325. , 1456. , 108632. , 1948. ],
[ 88.2, 258054. , 3682. , 1616. , 109773. , 1949. ],
[ 89.5, 284599. , 3351. , 1650. , 110929. , 1950. ],
[ 96.2, 328975. , 2099. , 3099. , 112075. , 1951. ]])
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [11]: data.endog_name
Out[11]: 'TOTEMP'
In [12]: data.exog_name
Out[12]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [13]: type(data.data)
Out[13]: numpy.core.records.recarray
In [14]: type(data.raw_data)
Out[14]: numpy.ndarray
In [15]: data.names
Out[15]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data as pandas objects:
In [16]: data = sm.datasets.longley.load_pandas()
In [17]: data.exog
Out[17]:
GNPDEFL GNP UNEMP ARMED POP YEAR
0 83.0 234289 2356 1590 107608 1947
1 88.5 259426 2325 1456 108632 1948
2 88.2 258054 3682 1616 109773 1949
3 89.5 284599 3351 1650 110929 1950
4 96.2 328975 2099 3099 112075 1951
5 98.1 346999 1932 3594 113270 1952
6 99.0 365385 1870 3547 115094 1953
7 100.0 363112 3578 3350 116219 1954
8 101.2 397469 2904 3048 117388 1955
9 104.6 419180 2822 2857 118734 1956
10 108.4 442769 2936 2798 120445 1957
11 110.8 444546 4681 2637 121950 1958
12 112.6 482704 3813 2552 123366 1959
13 114.2 502601 3931 2514 125368 1960
14 115.7 518173 4806 2572 127852 1961
15 116.9 554894 4007 2827 130081 1962
[16 rows x 6 columns]
In [18]: data.endog
Out[18]:
0 60323
1 61122
2 60171
3 61187
4 63221
5 63639
6 64989
7 63761
8 66019
9 67857
10 68169
11 66513
12 68655
13 69564
14 69331
15 70551
Name: TOTEMP, dtype: float64
With pandas integration in the estimation classes, the metadata will be attached to model results:
In [19]: y, x = data.endog, data.exog
In [20]: res = sm.OLS(y, x).fit()
In [21]: res.params
Out[21]:
GNPDEFL -52.993570
GNP 0.071073
UNEMP -0.423466
ARMED -0.572569
POP -0.414204
YEAR 48.417866
dtype: float64
In [22]: res.summary()
Out[22]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: TOTEMP R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.052e+04
Date: Thu, 13 Mar 2014 Prob (F-statistic): 8.20e-22
Time: 23:59:47 Log-Likelihood: -117.56
No. Observations: 16 AIC: 247.1
Df Residuals: 10 BIC: 251.8
Df Model: 6
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
GNPDEFL -52.9936 129.545 -0.409 0.691 -341.638 235.650
GNP 0.0711 0.030 2.356 0.040 0.004 0.138
UNEMP -0.4235 0.418 -1.014 0.335 -1.354 0.507
ARMED -0.5726 0.279 -2.052 0.067 -1.194 0.049
POP -0.4142 0.321 -1.289 0.226 -1.130 0.302
YEAR 48.4179 17.689 2.737 0.021 9.003 87.832
==============================================================================
Omnibus: 1.443 Durbin-Watson: 1.277
Prob(Omnibus): 0.486 Jarque-Bera (JB): 0.605
Skew: 0.476 Prob(JB): 0.739
Kurtosis: 3.031 Cond. No. 4.56e+05
==============================================================================
Warnings:
[1] The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']