Chapter 1: Data Acquisition, Data Structure, Data Types and Platforms for Data Analysis

Platforms for Data Analysis

Over the years, there have been many platforms created and used by data scientists for their work and in the community. A few common free-to-use options are R and Python. 


R is a high-level programming language mainly created for the purposes of statistical analysis. While it is useful for data exploration, analysis and visualization, its usage lies more commonly in the field of research although there have been some commercial uses and applications of R.


Python is another high-level programming language that is one of the most widely used for data science. This stems from the fact that Python is a beginner-friendly language that is easy to pick up and allows users to quickly get into Data Science and Machine Learning without having to go through a steep learning curve. It also serves as a general purpose programming languages and its dynamic use in other areas such as web programming and object-oriented programming allows it to have many other applications and is a good entry point for most people looking to learn programming, including for the purpose of data science.

For the purposes of explanation in this series of blog posts, Python will be the main programming language referenced when discussing about the Data Science process.

Data Acquisition

At the foundation of every data science and machine learning project is the availability and abundance of data. Hence, the data science process typically begins with data acquisition to obtain information relevant to the problem one is trying to solve. There are three main ways that one can go about acquiring relevant data: Web Mining or Web Scraping, through online APIs, and by utilizing data already available (usually collected and provided by a business or company).

Web Mining / Web Scraping

In order to obtain data through web mining or web scraping, there are many packages in Python that can be used, but typically, Selenium, Beautiful Soup and Urllib are used to parse web pages and to extract information to be stored for analysis. In order to understand and fully utilize these packages, a basic understanding of the web programming languages HTML, CSS and Javascript should be picked up before attempting to extract online information using these packages.

APIs

Data obtained through API requests in Python can be done through Urllib as mentioned earlier, given that the URL for data access is structured such that it can be dynamically iterated through. In addition, Python also contained a separate package called Postman, which simplifies the process and helps users perform HTTP requests for API data by passing the relevant API location to the package function.

Data Structure and Data Types


Data Structure

The most commonly used data structure in Python's data analysis process is Pandas' DataFrame. When data is read in to a dataframe, it is stored in a rectangular format, whereby each row represents one data point (for example each row corresponds to a customer's profile). On the other hand, columns are aligned such that each column represent a specific feature of the customer (for example, the income level of a customer). The alignment of columns to represent features also allows each column to have its specific data type (integers, floats, strings, etc.) as long as these columns only contains one type of data.

Data Types

In Data Science, there are a few common data types that are widely used and should be understood before we begin data processing and manipulation. These data types are integer, float, string, boolean, null, and object.

Integers refer to whole numbers values as opposed to floats, which refer to decimal-based numbers represented in the dataset.

Strings refer characters such as words, special characters and even digits. 

Boolean values are either True or False and may also be represented as 1 or 0.

Null or NaN values represents empty or missing data in the dataset, which should be treated as missing data for the purposes of data cleaning later on.

Objects represents either a non-primitive data type (all those shown above) or a any combination of the different data types mentioned earlier, since that does not allow each data column/feature to have a consistent data type to be represented in the dataset.

Comments

Popular Posts