In my earlier posts, I used the read_stata() methodology to learn Stata datasets into pandas knowledge bodys. This works effectively whenever you need to learn a whole Stata dataset into Python. However generally we want to learn a subset of the variables or observations, or each, from a Stata dataset into Python. On this publish, I’ll introduce you to the Stata Operate Interface (SFI) module and present you find out how to use it to learn partial datasets right into a pandas knowledge body.
If you’re not aware of Python, it might be useful to learn the primary 4 posts in my Stata/Python Integration sequence earlier than you learn additional.
- Establishing Stata to make use of Python
- 3 ways to make use of Python in Stata
- set up Python packages
- use Python packages
Utilizing the SFI module to maneuver knowledge from Stata to Python
The SFI is a Python module that means that you can move info forwards and backwards between Stata and Python. You’ll be able to copy entire or partial datasets, knowledge frames, native and international macros, scalars and matrices, and even international Mata matrices. There are far too many options to point out you in a single weblog publish. So right this moment I’m going to point out you a characteristic that you’re possible to make use of: studying partial Stata datasets into Python. We’ll discover extra SFI options in future posts.
Let’s start the code block under by utilizing the auto dataset. Subsequent, let’s enter the Python atmosphere and import the Knowledge class from the SFI module. Then, we are going to use the get() methodology within the Knowledge class to repeat the variable overseas right into a Python checklist object named dataraw. The primary argument of the get() methodology is a listing of Stata variables positioned in single quotes.
sysuse auto
python
from sfi import Knowledge
dataraw = Knowledge.get('overseas')
dataraw
finish
The Python output reveals us that the checklist object dataraw comprises the info for the Stata variable overseas.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> dataraw = Knowledge.get('overseas')
>>> dataraw
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
> , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> finish
--------------------------------------------------------------------------------
Specify a spread of observations
The second argument of the get() methodology permits us to specify a spread of observations. I’ve used the vary() operate within the code block under to specify observations 46 via 56. Notice that I’ve additionally added mpg and rep78 to the checklist of variables.
python
from sfi import Knowledge
dataraw = Knowledge.get('overseas mpg rep78',
vary(46,56))
dataraw
finish
The Python output reveals the contents of the checklist object dataraw. The checklist comprises sublists that every embrace three values. Every sublist is an commentary from the Stata dataset and comprises knowledge for the variables overseas, mpg, and rep78. The quantity 8.98846567431158e+307 within the fifth commentary is a lacking worth, and we are going to discover ways to deal with it under.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> dataraw = Knowledge.get('overseas mpg rep78',
... vary(46,56))
>>> dataraw
[[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307],
> [0, 24, 2], [1, 17, 5], [1, 23, 3], [1, 25, 4], [1, 23, 4]]
>>> finish
--------------------------------------------------------------------------------
Specify observations utilizing an indicator variable
The third argument of the get() methodology permits us to additional prohibit our knowledge primarily based on an indicator variable. Within the instance under, I’ve generated a brand new variable named touse that equals 1 if mpg is lower than 20 and 0 in any other case. Then, I’ve specified “touse” because the third argument in get().
generate touse = mpg<20 python from sfi import Knowledge dataraw = Knowledge.get('overseas mpg rep78', vary(46,56), "touse") dataraw finish
The Python output reveals that dataraw comprises solely observations the place mpg is lower than 20.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> dataraw = Knowledge.get('overseas mpg rep78',
... vary(46,56),
... "touse")
>>> dataraw
[[0, 18, 4], [0, 18, 1], [0, 19, 3], [0, 19, 3], [0, 19, 8.98846567431158e+307],
> [1, 17, 5]]
>>> finish
--------------------------------------------------------------------------------
Get worth labels relatively than numbers
The values of the Stata variable overseas are labeled with “Home” for 0 and “Overseas” for 1.
. checklist overseas in 50/54
+----------+
| overseas |
|----------|
50. | Home |
51. | Home |
52. | Home |
53. | Overseas |
54. | Overseas |
+----------+
Our Python checklist object dataraw shops solely the underlying numeric values 0 and 1, however we could favor to work with the labels. The fourth argument of get() permits us to move the worth labels of a Stata variable to Python relatively than the numbers. I’ve specified valuelabel=True within the code block under to move the worth labels to Python.
python
from sfi import Knowledge
dataraw = Knowledge.get('overseas mpg rep78',
vary(46,56),
"touse",
valuelabel=True)
dataraw
finish
The Python output under reveals us that dataraw now comprises the phrases “Home” and “Overseas”. Notice that these are strings relatively than labeled numeric values.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> dataraw = Knowledge.get('overseas mpg rep78',
... vary(46,56),
... "touse",
... valuelabel=True,)
>>> dataraw
[['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19,
> 3], ['Domestic', 19, 8.98846567431158e+307], ['Foreign', 17, 5]]
>>> finish
--------------------------------------------------------------------------------
Specify a quantity for lacking values
The fifth argument of get() permits us to specify a worth for lacking knowledge. Recall that Stata shops lacking values as the most important attainable worth for a numeric storage sort. The Stata variable rep78 is saved as a double-precision numeric variable that has a most worth of 8.98846567431158e+307. Floating-point numeric variables have a most worth of 1.70141173319e+38, lengthy variables have a most worth of two,147,483,620, int variables have a most worth of 32,740, and byte variables have a most worth of 100. Thus, the exact worth of a lacking worth relies on the storage sort of the variable.
Python doesn’t acknowledge these numbers as lacking values. Python interprets 8.98846567431158e+307 as a quantity. Lacking numeric values in Python are sometimes represented with Numpy’s particular floating-point worth “nan”, which was first outlined by the Institute of Electrical and Electronics Engineers within the IEEE 754-1985 Requirements. We are able to inform Python that 8.98846567431158e+307 is “not a quantity” (nan) by specifying missingval=np.nan for the fifth argument of get().
python from sfi import Knowledge import numpy as np dataraw = Knowledge.get('overseas mpg rep78', vary(46,56), "touse", valuelabel=True, missingval=np.nan) dataraw finish
The Python output under reveals that the quantity 8.98846567431158e+307 in dataraw has been changed with Numpy’s particular floating-point worth “nan”.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> import numpy as np
>>> dataraw = Knowledge.get('overseas mpg rep78',
... vary(46,56),
... "touse",
... valuelabel=True,
... missingval=np.nan)
>>> dataraw
[['Domestic', 18, 4], ['Domestic', 18, 1], ['Domestic', 19, 3], ['Domestic', 19,
> 3], ['Domestic', 19, nan], ['Foreign', 17, 5]]
>>> finish
--------------------------------------------------------------------------------
Convert the checklist object to a pandas knowledge body
We’ve used get() to repeat a part of our Stata dataset right into a Python checklist object named dataraw. Subsequent, let’s convert our checklist object to a pandas knowledge body.
We start by importing pandas utilizing the alias pd. Then, we are able to create a knowledge body by typing dataframe = pd.DataFrame(dataraw).
python from sfi import Knowledge import numpy as np import pandas as pd dataraw = Knowledge.get('overseas mpg rep78', vary(46,56), "touse", valuelabel=True, missingval=np.nan) dataframe = pd.DataFrame(dataraw) dataframe finish
The Python output under shows the info body dataframe. The columns labeled 0, 1, and a pair of are the variables overseas, mpg, and rep78, respectively. The unlabeled column on the left is an index that pandas created to uniquely establish every row.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> import numpy as np
>>> import pandas as pd
>>> dataraw = Knowledge.get('overseas mpg rep78',
... vary(46,56),
... "touse",
... valuelabel=True,
... missingval=np.nan)
>>> dataframe = pd.DataFrame(dataraw)
>>> dataframe
0 1 2
0 Home 18 4
1 Home 18 1
2 Home 19 3
3 Home 19 3
4 Home 19 NaN
5 Overseas 17 5
>>> finish
--------------------------------------------------------------------------------
Label the columns of a knowledge body
We are able to label the columns of our knowledge body utilizing the columns choice within the DataFrame() methodology. The checklist of column names have to be enclosed in sq. brackets, and every column title have to be enclosed in single quotes and separated by commas.
python
from sfi import Knowledge
import numpy as np
import pandas as pd
dataraw = Knowledge.get('overseas mpg rep78',
vary(46,56),
"touse",
valuelabel=True,
missingval=np.nan)
dataframe = pd.DataFrame(dataraw,
columns=['foreign', 'mpg', 'rep78'])
dataframe
finish
The Python output under reveals that the second, third, and fourth columns within the knowledge body are actually named overseas, mpg, and rep78, respectively.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> import numpy as np
>>> import pandas as pd
>>> dataraw = Knowledge.get('overseas mpg rep78',
... vary(46,56),
... "touse",
... valuelabel=True,
... missingval=np.nan)
>>> dataframe = pd.DataFrame(dataraw,
... columns=['foreign', 'mpg', 'rep78'])
>>> dataframe
overseas mpg rep78
0 Home 18 4
1 Home 18 1
2 Home 19 3
3 Home 19 3
4 Home 19 NaN
5 Overseas 17 5
>>> finish
--------------------------------------------------------------------------------
Start the info body index at 1
Python makes use of zero-based array indexing, which implies that row and column counts start with 0 relatively than 1. So pandas mechanically created a row index that begins with 0. You’ll be able to skip to the following part in case you are snug with the index starting at zero or you don’t plan to make use of the index. Or you may change the index to start with 1 utilizing the index choice within the DataFrame() methodology.
We are going to specify the index utilizing the arange() methodology within the Numpy module. The primary argument is the primary component of the row index, which is 1. The second argument is the final component of the row index. We may merely sort 6 as a result of there are 6 rows in our knowledge body. However this quantity may change the following time we run our code. We are able to use the len() methodology to calculate the size of the checklist object dataraw. And we should add 1 to the size of dataraw as a result of Python begins counting at 0.
python
from sfi import Knowledge
import numpy as np
import pandas as pd
dataraw = Knowledge.get('overseas mpg rep78',
vary(46,56),
"touse",
valuelabel=True,
missingval=np.nan)
dataframe = pd.DataFrame(dataraw,
columns=['foreign', 'mpg', 'rep78'],
index=[np.arange(1, len(dataraw)+1)])
dataframe
finish
The Python output under reveals us that the index for dataframe now begins at 1 and ends with 6.
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> import pandas as pd
>>> import numpy as np
>>> dataraw = Knowledge.get('overseas mpg rep78',
... vary(46,56),
... "touse",
... valuelabel=True,
... missingval=np.nan)
>>> dataframe = pd.DataFrame(dataraw,
... columns=['foreign', 'mpg', 'rep78'],
... index=[np.arange(1, len(dataraw)+1)])
>>> dataframe
overseas mpg rep78
1 Home 18 4
2 Home 18 1
3 Home 19 3
4 Home 19 3
5 Home 19 NaN
6 Overseas 17 5
>>> finish
--------------------------------------------------------------------------------
Utilizing getAsDict()
You may as well use getAsDict() to repeat Stata knowledge to a Python dictionary. The arguments are the identical as get(), and the ensuing dictionary comprises the names of the Stata variables. Which means we don’t have to call the columns after we convert the dictionary to a knowledge body. Making a data-frame index that begins with 1 is completely different as a result of the size of a dictionary shouldn’t be the variety of Stata observations. Within the code block under, I outlined obs because the size of a listing of the values within the dictionary dataraw. I used the subsequent() and iter() features to loop over the values within the dictionary dataraw. And I once more added 1 as a result of Python begins counting at 0.
python from sfi import Knowledge import pandas as pd import numpy as np dataraw = Knowledge.getAsDict('overseas mpg rep78', vary(46,56), "touse", valuelabel=True, missingval=np.nan) dataraw obs = len(subsequent(iter(dataraw.values()))) + 1 dataframe = pd.DataFrame(dataraw, index=[np.arange(1, obs)]) dataframe finish
The Python output under reveals that the ensuing knowledge body seems to be very similar to the info body we created utilizing get().
. python
----------------------------------------------- python (sort finish to exit) ------
>>> from sfi import Knowledge
>>> import pandas as pd
>>> import numpy as np
>>> dataraw = Knowledge.getAsDict('overseas mpg rep78',
... vary(46,56),
... "touse",
... valuelabel=True,
... missingval=np.nan)
>>> dataraw
{'overseas': ['Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Foreig
> n'], 'mpg': [18, 18, 19, 19, 19, 17], 'rep78': [4, 1, 3, 3, nan, 5]}
>>> obs = len(subsequent(iter(dataraw.values()))) + 1
>>> dataframe = pd.DataFrame(dataraw,
... index=[np.arange(1, obs)])
>>> dataframe
overseas mpg rep78
1 Home 18 4.0
2 Home 18 1.0
3 Home 19 3.0
4 Home 19 3.0
5 Home 19 NaN
6 Overseas 17 5.0
>>> finish
--------------------------------------------------------------------------------
Simply the fundamentals
Maybe you don’t want to prohibit your pattern, and also you don’t thoughts zero-based indexing. You simply need to copy a group of variables to a pandas knowledge body in Python. The code block under will do this and convert Stata lacking values to Python lacking values.
python
from sfi import Knowledge
import pandas as pd
import numpy as np
dataraw = Knowledge.getAsDict('overseas mpg rep78',
None,
None,
valuelabel=False,
missingval=np.nan)
dataframe = pd.DataFrame(dataraw)
dataframe
finish
The output under shows the info body dataframe, which is prepared for graphing or knowledge evaluation.
>>> dataframe
overseas mpg rep78
0 0 22 3.0
1 0 17 3.0
2 0 22 NaN
3 0 20 3.0
4 0 15 4.0
.. ... ... ...
69 1 23 4.0
70 1 41 5.0
71 1 25 4.0
72 1 25 4.0
73 1 17 5.0
[74 rows x 3 columns]
Conclusion
We did it! We used the get() and getAsDict() strategies within the Knowledge class of the SFI module to repeat a part of a Stata dataset to a Python knowledge body. We even accounted for lacking knowledge. And it’s straightforward to make use of get() and getAsDict() in our do-files, ado-files, and Python scripts anytime we need to incorporate Python into our knowledge administration, evaluation, or reporting. Subsequent time, I’ll present you find out how to use the SFI to repeat knowledge from Python right into a Stata dataset.
Additional studying
Di Russo, J. 2019. Navigating The Hell of NaNs in Python, In direction of Knowledge Science (weblog), October 23, 2019.
Fung, Ok. 2020. Array Indexing: 0-based or 1-based?, Analytics Vidhya, January 26, 2020.
