openGPMP
Open Source Mathematics Package
Public Member Functions | Private Attributes | List of all members
gpmp::core::DataTable Class Reference

#include <datatable.hpp>

Public Member Functions

 DataTable ()
 
DataTableStr csv_read (std::string filename, std::vector< std::string > columns={})
 Reads a CSV file and returns a DataTableStr parses CSV files and stores all data as strings. More...
 
void csv_write ()
 Write DataTable to a CSV file. More...
 
DataTableStr tsv_read (std::string filename, std::vector< std::string > columns={})
 Reads a TSV file and returns a DataTableStr parses TSV files and stores all data as strings. More...
 
DataTableStr json_read (std::string filename, std::vector< std::string > objs={})
 Reads a JSON file and returns a DataTableStr parses JSON files and stores all data as strings. More...
 
DataTableStr datetime (std::string column_name, bool extract_year=true, bool extract_month=true, bool extract_time=false)
 Extracts date and time components from a timestamp column. More...
 
void sort (const std::vector< std::string > &sort_columns, bool ascending=true)
 Sorts the rows of the DataTable based on specified columns. More...
 
std::vector< DataTableStrgroup_by (std::vector< std::string > group_by_columns)
 Groups the data by specified columns. More...
 
DataTableStr first (const std::vector< gpmp::core::DataTableStr > &groups) const
 Gets the first element of each created group. More...
 
void describe ()
 Prints some information about the DataTable. More...
 
DataTableInt str_to_int (DataTableStr src)
 Converts a DataTableStr to a DataTableInt. More...
 
DataTableDouble str_to_double (DataTableStr src)
 Converts a DataTableStr to a DataTableDouble. More...
 
template<typename T >
void display (std::pair< std::vector< T >, std::vector< std::vector< T >>> data, bool display_all=false)
 Sort a DataTable based on a specified column. More...
 
void display (bool display_all=false)
 Overload function for display() defaults to displaying what is currently stored in a DataTable object. More...
 
 DataTable ()
 DataTable constructor. Initializes column & row storage. More...
 
void printData ()
 
TableType csv_read_new (std::string filename, std::vector< std::string > columns={})
 Reads a CSV file and returns a DataTableStr parses CSV files and stores all data as strings. More...
 
DataTableStr csv_read (std::string filename, std::vector< std::string > columns)
 
void csv_write ()
 Write DataTable to a CSV file. More...
 
DataTableStr tsv_read (std::string filename, std::vector< std::string > columns={})
 Reads a TSV file and returns a DataTableStr parses TSV files and stores all data as strings. More...
 
DataTableStr json_read (std::string filename, std::vector< std::string > objs={})
 Reads a JSON file and returns a DataTableStr parses JSON files and stores all data as strings. More...
 
void drop (std::vector< std::string > column_name)
 Drop specified rows from a DataTable. More...
 
DataTableStr datetime (std::string column_name, bool extract_year=true, bool extract_month=true, bool extract_time=false)
 Extracts date and time components from a timestamp column. More...
 
void sort (const std::vector< std::string > &sort_columns, bool ascending=true)
 Sorts the rows of the DataTable based on specified columns. More...
 
std::vector< DataTableStrgroup_by (std::vector< std::string > group_by_columns)
 Groups the data by specified columns. More...
 
DataTableStr first (const std::vector< gpmp::core::DataTableStr > &groups) const
 Gets the first element of each created group. More...
 
void describe ()
 Displays some information about the DataTable. More...
 
void info ()
 Displays data types and null vals for each column. More...
 
TableType native_type (const std::vector< std::string > &skip_columns={})
 Converts DataTable column's rows to their native types. Since the existing DataTable read/load related methods hone in on the DataTableStr type, there must be a way to get those types to their native formats. More...
 
DataType inferType (const std::vector< std::string > &column)
 
DataTableInt str_to_int (DataTableStr src)
 Converts a DataTableStr to a DataTableInt. More...
 
DataTableDouble str_to_double (DataTableStr src)
 Converts a DataTableStr to a DataTableDouble. More...
 
void display (const TableType &data, bool display_all=false)
 Sort a DataTable based on a specified column. More...
 
void display (bool display_all=false)
 Overload function for display() defaults to displaying what is currently stored in a DataTable object. More...
 

Private Attributes

std::vector< std::string > headers_
 
std::vector< std::vector< std::string > > rows_
 
std::vector< std::string > new_headers_
 
std::vector< std::vector< std::string > > data_
 
DataTableStr original_data_
 
MixedType rows_
 
MixedType data_
 

Detailed Description

Examples
linreg.cpp.

Definition at line 76 of file datatable.hpp.

Constructor & Destructor Documentation

◆ DataTable() [1/2]

gpmp::core::DataTable::DataTable ( )
inline

Definition at line 91 of file datatable.hpp.

91  {
92  // Initialize data_ and headers_ to empty vectors
93  data_ = std::vector<std::vector<std::string>>();
94  headers_ = std::vector<std::string>();
95  }
std::vector< std::vector< std::string > > data_
Definition: datatable.hpp:85
std::vector< std::string > headers_
Definition: datatable.hpp:79

References data_, and headers_.

◆ DataTable() [2/2]

gpmp::core::DataTable::DataTable ( )
inline

DataTable constructor. Initializes column & row storage.

Definition at line 111 of file datatable_wip.hpp.

111  {
112  // Initialize data_ and headers_ to empty vectors
113  headers_ = std::vector<std::string>();
114  data_ = MixedType();
115  }
std::vector< std::vector< std::variant< int64_t, long double, std::string > > > MixedType

References data_, and headers_.

Member Function Documentation

◆ csv_read() [1/2]

DataTableStr gpmp::core::DataTable::csv_read ( std::string  filename,
std::vector< std::string >  columns 
)

◆ csv_read() [2/2]

gpmp::core::TableType gpmp::core::DataTable::csv_read ( std::string  filename,
std::vector< std::string >  columns = {} 
)

Reads a CSV file and returns a DataTableStr parses CSV files and stores all data as strings.

Parameters
filenamethe path to the CSV file
columnsoptional vector of column names to read in, if empty all columns will be read in
Returns
a DataTableStr containing the column names and data
Examples
linreg.cpp.

Definition at line 57 of file datatable.cpp.

58  {
59 
60  int fd = open(filename.c_str(), O_RDONLY);
61  if (fd == -1) {
62  // Handle file open error
63  perror("Error opening file");
64  exit(EXIT_FAILURE);
65  }
66 
67  off_t size = lseek(fd, 0, SEEK_END);
68  char *file_data =
69  static_cast<char *>(mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0));
70 
71  if (file_data == MAP_FAILED) {
72  // Handle memory mapping error
73  perror("Error mapping file to memory");
74  close(fd);
75  exit(EXIT_FAILURE);
76  }
77 
78  std::stringstream file_stream(file_data);
79  std::vector<std::vector<std::string>> data;
80  std::string line;
81 
82  // Get the header line and parse the column names
83  getline(file_stream, line);
84  std::stringstream header(line);
85  std::vector<std::string> header_cols;
86  std::string columnName;
87 
88  while (getline(header, columnName, ',')) {
89  header_cols.push_back(columnName);
90  }
91 
92  // If no columns are specified, read in all columns
93  if (columns.empty()) {
94  columns = header_cols;
95  }
96 
97  // Check if specified columns exist in the header
98  for (const auto &column : columns) {
99  if (std::find(header_cols.begin(), header_cols.end(), column) ==
100  header_cols.end()) {
101  // Handle column not found error
102  perror(("Column: " + column + " not found").c_str());
103  munmap(file_data, size);
104  close(fd);
105  exit(EXIT_FAILURE);
106  }
107  }
108 
109  // Read in the data rows
110  while (getline(file_stream, line)) {
111  std::vector<std::string> row;
112  std::stringstream rowStream(line);
113  std::string value;
114  int columnIndex = 0;
115 
116  while (getline(rowStream, value, ',')) {
117  // If column is specified, only read in specified columns
118  if (std::find(columns.begin(),
119  columns.end(),
120  header_cols[columnIndex]) != columns.end()) {
121  row.push_back(value);
122  }
123 
124  columnIndex++;
125  }
126 
127  if (!row.empty()) {
128  data.push_back(row);
129  }
130  }
131 
132  // populate headers_ class variable
133  headers_ = columns;
134  // populate data_ class variable
135  data_ = data;
136 
137  munmap(file_data, size);
138  close(fd);
139 
140  return std::make_pair(headers_, data_);
141 }

References data_, and headers_.

Referenced by main(), and test_train().

◆ csv_read_new()

TableType gpmp::core::DataTable::csv_read_new ( std::string  filename,
std::vector< std::string >  columns = {} 
)

Reads a CSV file and returns a DataTableStr parses CSV files and stores all data as strings.

Parameters
filenamethe path to the CSV file
columnsoptional vector of column names to read in, if empty all columns will be read in
Returns
a DataTableStr containing the column names and data

◆ csv_write() [1/2]

void gpmp::core::DataTable::csv_write ( )

Write DataTable to a CSV file.

◆ csv_write() [2/2]

void gpmp::core::DataTable::csv_write ( )

Write DataTable to a CSV file.

◆ datetime() [1/2]

gpmp::core::DataTableStr gpmp::core::DataTable::datetime ( std::string  column_name,
bool  extract_year = true,
bool  extract_month = true,
bool  extract_time = false 
)

Extracts date and time components from a timestamp column.

Parameters
column_nameThe name of the timestamp column
extract_yearIf true, extract the year component
extract_monthIf true, extract the month component
extract_timeIf true, extract the time component
Returns
A new DataTableStr with extracted components

Definition at line 147 of file datatable.cpp.

150  {
151  // Find the index of the specified column
152  auto column_iter = std::find(headers_.begin(), headers_.end(), column_name);
153  if (column_iter == headers_.end()) {
154  _log_.log(ERROR, "Column: " + column_name + " node found");
155  exit(EXIT_FAILURE);
156  }
157  int column_index = std::distance(headers_.begin(), column_iter);
158 
159  // Extract components from each row
160  std::vector<std::string> new_headers = headers_;
161  std::vector<std::vector<std::string>> new_data;
162 
163  // Iterate and populate the additional columns
164  for (size_t row_index = 0; row_index < data_.size(); ++row_index) {
165  std::vector<std::string> row = data_[row_index];
166  // If column row is not found
167  if (row.size() <= static_cast<size_t>(column_index)) {
168  _log_.log(ERROR, "Column: " + column_name + " not found");
169 
170  exit(EXIT_FAILURE);
171  }
172 
173  std::string timestamp = row[column_index];
174  std::string year, month, time;
175 
176  // Create a new row with extracted components
177  std::vector<std::string> new_row;
178 
179  // Extract year, month, and time components
180  if (extract_year) {
181  year = timestamp.substr(timestamp.find_last_of('/') + 1, 4);
182  new_row.push_back(year);
183  }
184  if (extract_month) {
185  month = timestamp.substr(0, timestamp.find_first_of('/'));
186  new_row.push_back(month);
187  }
188  if (extract_time) {
189  time = timestamp.substr(timestamp.find(' ') + 1);
190  new_row.push_back(time);
191  }
192 
193  // append original row data
194  new_row.insert(new_row.end(), row.begin(), row.end());
195  // add new rows
196  new_data.push_back(new_row);
197  }
198 
199  // Create new headers based on the extracted components
200  if (extract_month)
201  new_headers.insert(new_headers.begin(), "Month");
202  if (extract_year)
203  new_headers.insert(new_headers.begin(), "Year");
204  if (extract_time)
205  new_headers.insert(new_headers.begin(), "Time");
206 
207  // set class car data_ to hold rows/lines
208  data_ = new_data;
209  // set class var modified headers to new headers
210  // new_headers_ = new_headers;
211  headers_ = new_headers;
212 
213  return std::make_pair(new_headers, new_data);
214 }
void log(LogLevel level, const std::string &message)
Logs a message with the specified log level.
Definition: utils.cpp:77
static gpmp::core::Logger _log_
Definition: datatable.cpp:50
static int year
@ ERROR
Definition: utils.hpp:48

References _log_, ERROR, gpmp::core::Logger::log(), and year.

◆ datetime() [2/2]

DataTableStr gpmp::core::DataTable::datetime ( std::string  column_name,
bool  extract_year = true,
bool  extract_month = true,
bool  extract_time = false 
)

Extracts date and time components from a timestamp column.

Parameters
column_nameThe name of the timestamp column
extract_yearIf true, extract the year component
extract_monthIf true, extract the month component
extract_timeIf true, extract the time component
Returns
A new DataTableStr with extracted components

◆ describe() [1/2]

void gpmp::core::DataTable::describe ( )

Prints some information about the DataTable.

◆ describe() [2/2]

void gpmp::core::DataTable::describe ( )

Displays some information about the DataTable.

◆ display() [1/4]

void gpmp::core::DataTable::display ( bool  display_all = false)
inline

Overload function for display() defaults to displaying what is currently stored in a DataTable object.

Parameters
display_allDisplay all rows, defaults to false.

Definition at line 316 of file datatable.hpp.

316  {
317  display(std::make_pair(headers_, data_), display_all);
318  }
void display(std::pair< std::vector< T >, std::vector< std::vector< T >>> data, bool display_all=false)
Sort a DataTable based on a specified column.
Definition: datatable.hpp:216

References data_, display(), and headers_.

◆ display() [2/4]

void gpmp::core::DataTable::display ( bool  display_all = false)

Overload function for display() defaults to displaying what is currently stored in a DataTable object.

Parameters
display_allDisplay all rows, defaults to false.

◆ display() [3/4]

void gpmp::core::DataTable::display ( const TableType data,
bool  display_all = false 
)

Sort a DataTable based on a specified column.

Displays a DataTable of type T with the option to display all or a subset of rows

Template Parameters
TThe type of the DataTable to be displayed
Parameters
dataA pair of vectors representing the header and data rows of the DataTable
display_allA flag indicating whether to display all rows or just a subset

Definition at line 196 of file datatable2.cpp.

197  {
198  int num_columns = data.first.size();
199  int num_rows = data.second.size();
200  int num_omitted_rows = 0;
201 
202  std::vector<int> max_column_widths(num_columns, 0);
203 
204  // Calculate the maximum width for each column based on column headers
205  for (int i = 0; i < num_columns; i++) {
206  max_column_widths[i] = data.first[i].length();
207  }
208 
209  // Calculate the maximum width for each column based on data rows
210  for (int i = 0; i < num_columns; i++) {
211  for (const auto &row : data.second) {
212  if (i < static_cast<int>(row.size())) {
213  std::visit(
214  [&max_column_widths, &i](const auto &cellValue) {
215  using T = std::decay_t<decltype(cellValue)>;
216  if constexpr (std::is_same_v<T, std::string>) {
217  max_column_widths[i] =
218  std::max(max_column_widths[i],
219  static_cast<int>(cellValue.length()));
220  } else if constexpr (std::is_integral_v<T> ||
221  std::is_floating_point_v<T>) {
222  max_column_widths[i] = std::max(
223  max_column_widths[i],
224  static_cast<int>(
225  std::to_string(cellValue).length()));
226  }
227  },
228  row[i]);
229  }
230  }
231  }
232 
233  const int dateTimeColumnIndex = 0;
234  max_column_widths[dateTimeColumnIndex] =
235  std::max(max_column_widths[dateTimeColumnIndex], 0);
236 
237  // Define a function to print a row
238  auto printRow = [&data, &max_column_widths, num_columns](int row_index) {
239  std::cout << std::setw(7) << std::right << row_index << " ";
240 
241  for (int j = 0; j < num_columns; j++) {
242  if (j < static_cast<int>(data.second[row_index].size())) {
243  std::visit(
244  [&max_column_widths, &j](const auto &cellValue) {
245  using T = std::decay_t<decltype(cellValue)>;
246  if constexpr (std::is_same_v<T, double> ||
247  std::is_same_v<T, long double>) {
248  // Convert the value to a string without trailing
249  // zeros
250  std::string cellValueStr =
251  std::to_string(cellValue);
252  cellValueStr.erase(
253  cellValueStr.find_last_not_of('0') + 1,
254  std::string::npos);
255  cellValueStr.erase(
256  cellValueStr.find_last_not_of('.') + 1,
257  std::string::npos);
258 
259  std::cout << std::setw(max_column_widths[j])
260  << std::right << cellValueStr << " ";
261  } else {
262  std::cout << std::setw(max_column_widths[j])
263  << std::right << cellValue << " ";
264  }
265  },
266  data.second[row_index][j]);
267  }
268  }
269 
270  std::cout << std::endl;
271  };
272 
273  // Print headers
274  std::cout << std::setw(7) << std::right << "Index"
275  << " ";
276  for (int i = 0; i < num_columns; i++) {
277  std::cout << std::setw(max_column_widths[i]) << std::right
278  << data.first[i] << " ";
279  }
280  std::cout << std::endl;
281 
282  int num_elements = data.second.size();
283  if (!display_all && num_elements > MAX_ROWS) {
284  for (int i = 0; i < SHOW_ROWS; i++) {
285  printRow(i);
286  }
287  num_omitted_rows = num_elements - MAX_ROWS;
288  std::cout << "...\n";
289  std::cout << "[" << num_omitted_rows << " rows omitted]\n";
290  for (int i = num_elements - SHOW_ROWS; i < num_elements; i++) {
291  printRow(i);
292  }
293  } else {
294  // Print all rows
295  for (int i = 0; i < num_elements; i++) {
296  printRow(i);
297  }
298  }
299 
300  // Print the number of rows and columns
301  std::cout << "[" << num_rows << " rows"
302  << " x " << num_columns << " columns";
303  std::cout << "]\n\n";
304 }
#define MAX_ROWS
Definition: datatable.hpp:41
#define SHOW_ROWS
Definition: datatable.hpp:42

References MAX_ROWS, and SHOW_ROWS.

◆ display() [4/4]

template<typename T >
void gpmp::core::DataTable::display ( std::pair< std::vector< T >, std::vector< std::vector< T >>>  data,
bool  display_all = false 
)
inline

Sort a DataTable based on a specified column.

Displays a DataTable of type T with the option to display all or a subset of rows

Template Parameters
TThe type of the DataTable to be displayed
Parameters
dataA pair of vectors representing the header and data rows of the DataTable
display_allA flag indicating whether to display all rows or just a subset

Definition at line 216 of file datatable.hpp.

217  {
218  // Get the number of columns and rows in the data
219  int num_columns = data.first.size();
220  int num_rows = data.second.size();
221  int num_omitted_rows = 0;
222 
223  // Initialize max_column_widths with the lengths of column headers
224  std::vector<int> max_column_widths(num_columns, 0);
225 
226  // Calculate the maximum width for each column based on column headers
227  for (int i = 0; i < num_columns; i++) {
228  max_column_widths[i] = data.first[i].length();
229  }
230 
231  // Calculate the maximum width for each column based on data rows
232  for (int i = 0; i < num_columns; i++) {
233  for (const auto &row : data.second) {
234  if (i < static_cast<int>(row.size())) {
235  max_column_widths[i] =
236  std::max(max_column_widths[i],
237  static_cast<int>(row[i].length()));
238  }
239  }
240  }
241 
242  // Set a larger width for the DateTime column (adjust the index as
243  // needed later on)
244  const int dateTimeColumnIndex = 0;
245  // adjust as needed?
246  max_column_widths[dateTimeColumnIndex] =
247  std::max(max_column_widths[dateTimeColumnIndex], 0);
248 
249  // Print headers with right-aligned values
250  std::cout << std::setw(7) << std::right << "Index"
251  << " ";
252 
253  for (int i = 0; i < num_columns; i++) {
254  std::cout << std::setw(max_column_widths[i]) << std::right
255  << data.first[i] << " ";
256  }
257  std::cout << std::endl;
258 
259  int num_elements = data.second.size();
260  if (!display_all && num_elements > MAX_ROWS) {
261  for (int i = 0; i < SHOW_ROWS; i++) {
262  // Prit index
263  std::cout << std::setw(7) << std::right << i << " ";
264  // Print each row with right-aligned values
265  for (int j = 0; j < num_columns; j++) {
266  if (j < static_cast<int>(data.second[i].size())) {
267  std::cout << std::setw(max_column_widths[j])
268  << std::right << data.second[i][j] << " ";
269  }
270  }
271  std::cout << std::endl;
272  }
273  num_omitted_rows = num_elements - MAX_ROWS;
274  std::cout << "...\n";
275  std::cout << "[" << num_omitted_rows << " rows omitted]\n";
276  for (int i = num_elements - SHOW_ROWS; i < num_elements; i++) {
277  std::cout << std::setw(7) << std::right << i << " ";
278  // Print each row with right-aligned values
279  for (int j = 0; j < num_columns; j++) {
280  if (j < static_cast<int>(data.second[i].size())) {
281  std::cout << std::setw(max_column_widths[j])
282  << std::right << data.second[i][j] << " ";
283  }
284  }
285  std::cout << std::endl;
286  }
287  } else {
288  // Print all rows with right-aligned values
289  for (int i = 0; i < num_elements; i++) {
290 
291  // Print index
292  std::cout << std::setw(7) << std::right << i << " ";
293  for (int j = 0; j < num_columns; j++) {
294  if (j < static_cast<int>(data.second[i].size())) {
295 
296  // Print formatted row
297  std::cout << std::setw(max_column_widths[j])
298  << std::right << data.second[i][j] << " ";
299  }
300  }
301  std::cout << std::endl;
302  }
303  }
304 
305  // Print the number of rows and columns
306  std::cout << "[" << num_rows << " rows"
307  << " x " << num_columns << " columns";
308  std::cout << "]\n\n";
309  }

References MAX_ROWS, and SHOW_ROWS.

Referenced by display(), and main().

◆ drop()

void gpmp::core::DataTable::drop ( std::vector< std::string >  column_name)

Drop specified rows from a DataTable.

◆ first() [1/2]

gpmp::core::DataTableStr gpmp::core::DataTable::first ( const std::vector< gpmp::core::DataTableStr > &  groups) const

Gets the first element of each created group.

Parameters
groupsReturn type of gpmp::core::DataTable.group()
Returns
a DataTableStr

Definition at line 317 of file datatable.cpp.

318  {
319  if (groups.empty()) {
320  // Handle the case when there are no groups
321  return std::make_pair(std::vector<std::string>(),
322  std::vector<std::vector<std::string>>());
323  }
324 
325  std::vector<std::vector<std::string>> first_rows;
326 
327  for (const gpmp::core::DataTableStr &group : groups) {
328  if (!group.second.empty()) {
329  first_rows.push_back(
330  group.second[0]); // Get the first row of each group
331  }
332  }
333 
334  if (!first_rows.empty()) {
335  // Assuming all groups have the same headers as the first group
336  return std::make_pair(groups[0].first, first_rows);
337  } else {
338  // Handle the case when there are no first rows found.
339  return std::make_pair(groups[0].first,
340  std::vector<std::vector<std::string>>());
341  }
342 }
DataTableStr first(const std::vector< gpmp::core::DataTableStr > &groups) const
Gets the first element of each created group.
Definition: datatable.cpp:317
std::pair< std::vector< std::string >, std::vector< std::vector< std::string > > > DataTableStr
Definition: datatable.hpp:65

◆ first() [2/2]

DataTableStr gpmp::core::DataTable::first ( const std::vector< gpmp::core::DataTableStr > &  groups) const

Gets the first element of each created group.

Parameters
groupsReturn type of gpmp::core::DataTable.group()
Returns
a DataTableStr

◆ group_by() [1/2]

std::vector< gpmp::core::DataTableStr > gpmp::core::DataTable::group_by ( std::vector< std::string >  group_by_columns)

Groups the data by specified columns.

Parameters
group_by_columnsThe column names to group by
Returns
A vector of DataTableStr, each containing a group of rows

Definition at line 250 of file datatable.cpp.

250  {
251  // Find the indices of the specified group by columns
252  std::vector<int> group_by_indices;
253 
254  // Traverse group column names
255  for (const std::string &column_name : group_by_columns) {
256  std::cout << "Searching for column: " << column_name << std::endl;
257 
258  // Find start/end and match column name
259  auto column_iter =
260  std::find(headers_.begin(), headers_.end(), column_name);
261 
262  // If no columns
263  if (column_iter == headers_.end()) {
264  _log_.log(ERROR, "Column: " + column_name + " not found");
265  exit(EXIT_FAILURE);
266  }
267  // column index set to distance from start of first col to nexter iter
268  int column_index = std::distance(headers_.begin(), column_iter);
269  // add column index to group
270  group_by_indices.push_back(column_index);
271  }
272 
273  // Group the data based on the specified columns using a vector
274  std::vector<std::pair<std::vector<std::string>, gpmp::core::DataTableStr>>
275  groups;
276 
277  // Traverse row/line data
278  for (const std::vector<std::string> &row : data_) {
279  // store group key for each row
280  std::vector<std::string> group_key;
281  // Fill group key from specified group column names
282  for (int index : group_by_indices) {
283  group_key.push_back(row[index]);
284  }
285 
286  // Check if the group already exists
287  auto group_iter = std::find_if(
288  groups.begin(),
289  groups.end(),
290  [&group_key](const std::pair<std::vector<std::string>,
291  gpmp::core::DataTableStr> &group) {
292  return group.first == group_key;
293  });
294  // If the group DNE create a new one to add to groups vector
295  if (group_iter == groups.end()) {
296  // Create a new group
297  groups.push_back(
298  {group_key, gpmp::core::DataTableStr(headers_, {})});
299  group_iter = groups.end() - 1;
300  }
301  // Add current row to group
302  group_iter->second.second.push_back(row);
303  }
304 
305  // Extract the grouped data into a vector
306  std::vector<gpmp::core::DataTableStr> grouped_data;
307  // Iterate over sorted groups to push onto result vector
308  for (const auto &group : groups) {
309  grouped_data.push_back(group.second);
310  }
311 
312  // Return final DataTableStr type
313  return grouped_data;
314 }

References _log_, ERROR, and gpmp::core::Logger::log().

◆ group_by() [2/2]

std::vector<DataTableStr> gpmp::core::DataTable::group_by ( std::vector< std::string >  group_by_columns)

Groups the data by specified columns.

Parameters
group_by_columnsThe column names to group by
Returns
A vector of DataTableStr, each containing a group of rows

◆ inferType()

gpmp::core::DataType gpmp::core::DataTable::inferType ( const std::vector< std::string > &  column)

Definition at line 135 of file datatable1.cpp.

135  {
136  int integer_count = 0;
137  int double_count = 0;
138  int string_count = 0;
139 
140  for (const std::string &cell : column) {
141  if (is_int(cell)) {
142  integer_count++;
143  } else if (is_double(cell)) {
144  double_count++;
145  } else {
146  string_count++;
147  }
148  }
149 
150  _log_.log(INFO,
151  "int/double/str: " + std::to_string(integer_count) + "/" +
152  std::to_string(double_count) + "/" +
153  std::to_string(string_count));
154 
155  if (integer_count > double_count) {
156  return DataType::dt_int32;
157  } else if (double_count > integer_count) {
158  return DataType::dt_double;
159  } else {
160  return DataType::dt_str;
161  }
162 }
bool is_double(const std::string &str)
Definition: datatable1.cpp:130
static gpmp::core::Logger _log_
Definition: datatable1.cpp:49
bool is_int(const std::string &str)
Definition: datatable1.cpp:124
@ INFO
Definition: utils.hpp:48

References _log_, gpmp::core::dt_double, gpmp::core::dt_int32, gpmp::core::dt_str, INFO, is_double(), is_int(), and gpmp::core::Logger::log().

◆ info()

void gpmp::core::DataTable::info ( )

Displays data types and null vals for each column.

Definition at line 647 of file datatable2.cpp.

647  {
648  // Calculate memory usage for each column and keep track of data type
649  std::vector<double> column_memory_usages(headers_.size(), 0.0);
650  std::vector<std::string> column_data_types(headers_.size());
651  double total_memory_usage_kb = 0.0;
652 
653  // Calculate memory usage in bytes for the entire table
654  size_t memory_usage_bytes = sizeof(headers_);
655  for (const auto &row : data_) {
656  for (size_t i = 0; i < row.size(); ++i) {
657  if (std::holds_alternative<int64_t>(row[i])) {
658  memory_usage_bytes += sizeof(int64_t);
659  column_memory_usages[i] +=
660  static_cast<double>(sizeof(int64_t)) / 1024.0;
661  column_data_types[i] = "int64_t";
662  } else if (std::holds_alternative<long double>(row[i])) {
663  memory_usage_bytes += sizeof(long double);
664  column_memory_usages[i] +=
665  static_cast<double>(sizeof(long double)) / 1024.0;
666  column_data_types[i] = "long double";
667  } else if (std::holds_alternative<std::string>(row[i])) {
668  memory_usage_bytes += std::get<std::string>(row[i]).capacity();
669  column_memory_usages[i] +=
670  static_cast<double>(
671  std::get<std::string>(row[i]).capacity()) /
672  1024.0;
673  column_data_types[i] = "std::string";
674  }
675  }
676  }
677 
678  // Convert total memory usage to KB
679  total_memory_usage_kb = static_cast<double>(memory_usage_bytes) / 1024.0;
680 
681  // Find the maximum column name length
682  size_t max_column_name_length = 0;
683  for (const std::string &column : headers_) {
684  max_column_name_length =
685  std::max(max_column_name_length, column.length());
686  }
687 
688  // Find the maximum data type length
689  size_t max_data_type_length = 0;
690  for (const std::string &data_type : column_data_types) {
691  max_data_type_length =
692  std::max(max_data_type_length, data_type.length());
693  }
694 
695  // Set the column width for formatting
696  int column_width = static_cast<int>(std::max(max_column_name_length,
697  max_data_type_length)) +
698  2; // Add extra padding
699 
700  // Print header
701  std::cout << std::left << std::setw(column_width) << "Column"
702  << std::setw(column_width) << "Type" << std::setw(column_width)
703  << "Memory Usage (KB)" << std::endl;
704 
705  // Print data
706  for (size_t i = 0; i < headers_.size(); ++i) {
707  std::cout << std::left << std::setw(column_width) << headers_[i]
708  << std::setw(column_width) << column_data_types[i]
709  << std::setw(column_width) << std::fixed
710  << std::setprecision(2) << column_memory_usages[i]
711  << std::endl;
712  }
713 
714  // Print total table memory usage
715  std::cout << "\nTotal Memory Usage: " << std::fixed << std::setprecision(2)
716  << total_memory_usage_kb << " KB" << std::endl;
717 }

◆ json_read() [1/2]

DataTableStr gpmp::core::DataTable::json_read ( std::string  filename,
std::vector< std::string >  objs = {} 
)

Reads a JSON file and returns a DataTableStr parses JSON files and stores all data as strings.

Parameters
filenamethe path to the JSON file
objsoptional vector of JSON object names to read in, if empty all objects will be read in
Returns
a DataTableStr containing the column names and data

◆ json_read() [2/2]

DataTableStr gpmp::core::DataTable::json_read ( std::string  filename,
std::vector< std::string >  objs = {} 
)

Reads a JSON file and returns a DataTableStr parses JSON files and stores all data as strings.

Parameters
filenamethe path to the JSON file
objsoptional vector of JSON object names to read in, if empty all objects will be read in
Returns
a DataTableStr containing the column names and data

◆ native_type()

gpmp::core::TableType gpmp::core::DataTable::native_type ( const std::vector< std::string > &  skip_columns = {})

Converts DataTable column's rows to their native types. Since the existing DataTable read/load related methods hone in on the DataTableStr type, there must be a way to get those types to their native formats.

Definition at line 176 of file datatable1.cpp.

177  {
178  gpmp::core::TableType mixed_data;
179 
180  // Include all column headers in mixed_data (including skipped ones)
181  mixed_data.first = headers_;
182 
183  std::vector<gpmp::core::DataType> column_data_types;
184 
185  // Determine data types for each column (skip_columns remain as strings)
186  for (size_t col = 0; col < headers_.size(); ++col) {
187  // Check if this column should be skipped
188  if (std::find(skip_columns.begin(),
189  skip_columns.end(),
190  headers_[col]) != skip_columns.end()) {
191  column_data_types.push_back(gpmp::core::DataType::dt_str);
192  _log_.log(INFO, "Skipping column: " + headers_[col]);
193  } else {
194  std::vector<std::string> column_data;
195  for (const std::vector<std::string> &rowData : data_) {
196  column_data.push_back(rowData[col]);
197  }
198  gpmp::core::DataType column_type = inferType(column_data);
199  column_data_types.push_back(column_type);
200 
201  _log_.log(INFO,
202  "Column " + headers_[col] +
203  " using type: " + dt_to_str(column_type));
204  }
205  }
206 
207  // Traverse rows and convert based on the determined data types
208  for (const std::vector<std::string> &row : data_) {
209  std::vector<std::variant<int64_t, long double, std::string>> mixed_row;
210 
211  for (size_t col = 0; col < headers_.size(); ++col) {
212  const std::string &cell = row[col];
213  gpmp::core::DataType column_type = column_data_types[col];
214 
215  if (column_type == gpmp::core::DataType::dt_int32) {
216 
217  mixed_row.push_back(std::stoi(cell));
218  } else if (column_type == gpmp::core::DataType::dt_double) {
219  mixed_row.push_back(std::stold(cell));
220  } else {
221  mixed_row.push_back(cell); // Keep as a string
222  }
223  }
224 
225  mixed_data.second.push_back(mixed_row);
226  }
227 
228  std::cout << "Mixed Data:" << std::endl;
229  for (const std::string &header : mixed_data.first) {
230  std::cout << header << " ";
231  }
232  std::cout << std::endl;
233 
234  for (const auto &row : mixed_data.second) {
235  for (const auto &cell : row) {
236  if (std::holds_alternative<int64_t>(cell)) {
237  std::cout << std::get<int64_t>(cell) << " ";
238  } else if (std::holds_alternative<long double>(cell)) {
239  std::cout << std::get<long double>(cell) << " ";
240  } else if (std::holds_alternative<std::string>(cell)) {
241  std::cout << std::get<std::string>(cell) << " ";
242  }
243  }
244  std::cout << std::endl;
245  }
246 
247  return mixed_data;
248 }
DataType inferType(const std::vector< std::string > &column)
Definition: datatable1.cpp:135
std::string dt_to_str(gpmp::core::DataType type)
Definition: datatable1.cpp:163
std::pair< std::vector< std::string >, std::vector< std::vector< std::variant< int64_t, long double, std::string > > > > TableType
DataType
enum for representing different data types
Definition: datatable.hpp:59

References _log_, gpmp::core::dt_double, gpmp::core::dt_int32, gpmp::core::dt_str, dt_to_str(), INFO, and gpmp::core::Logger::log().

◆ printData()

void gpmp::core::DataTable::printData ( )
inline

Definition at line 116 of file datatable_wip.hpp.

116  {
117  // Print column headers
118  for (const auto &header : headers_) {
119  std::cout << header << "\t";
120  }
121  std::cout << std::endl;
122 
123  // Print data rows
124  for (const auto &row : data_) {
125  for (const auto &cell : row) {
126  // Check the type of cell and print accordingly
127  if (std::holds_alternative<int64_t>(cell)) {
128  std::cout << std::get<int64_t>(cell);
129  } else if (std::holds_alternative<long double>(cell)) {
130  std::cout << std::get<long double>(cell);
131  } else if (std::holds_alternative<std::string>(cell)) {
132  std::cout << std::get<std::string>(cell);
133  }
134 
135  std::cout << "\t";
136  }
137  std::cout << std::endl;
138  }
139  }

References data_, and headers_.

◆ sort() [1/2]

void gpmp::core::DataTable::sort ( const std::vector< std::string > &  sort_columns,
bool  ascending = true 
)

Sorts the rows of the DataTable based on specified columns.

Parameters
sort_columnsA vector of column names to sort by.
ascendingIf true, sort in ascending order; otherwise, sort in descending order, default is true.

Definition at line 217 of file datatable.cpp.

218  {
219  // Extract the column indices to be sorted by from the original data
220  std::vector<size_t> column_indices;
221  for (const std::string &column : sort_columns) {
222  auto iter = std::find(headers_.begin(), headers_.end(), column);
223  if (iter != headers_.end()) {
224  size_t index = std::distance(headers_.begin(), iter);
225  column_indices.push_back(index);
226  }
227  }
228 
229  // Sort the data based on the specified columns
230  std::stable_sort(data_.begin(),
231  data_.end(),
232  [&](const std::vector<std::string> &row1,
233  const std::vector<std::string> &row2) {
234  for (size_t index : column_indices) {
235  if (row1[index] != row2[index]) {
236  if (ascending) {
237  return row1[index] < row2[index];
238  } else {
239  return row1[index] > row2[index];
240  }
241  }
242  }
243  // Rows are equal, nothing to sort
244  return false;
245  });
246 }

◆ sort() [2/2]

void gpmp::core::DataTable::sort ( const std::vector< std::string > &  sort_columns,
bool  ascending = true 
)

Sorts the rows of the DataTable based on specified columns.

Parameters
sort_columnsA vector of column names to sort by.
ascendingIf true, sort in ascending order; otherwise, sort in descending order, default is true.

◆ str_to_double() [1/2]

gpmp::core::DataTableDouble gpmp::core::DataTable::str_to_double ( DataTableStr  src)

Converts a DataTableStr to a DataTableDouble.

Parameters
srcA DataTableStr object to be converted
Returns
The converted DataTableDouble
Note
The function assumes the input DataTableStr contains only valid double type elements

Definition at line 370 of file datatable.cpp.

370  {
372 
373  for (const auto &v : src.first) {
374  if (std::regex_match(v, std::regex("[-+]?\\d*\\.?\\d+"))) {
375  dest.first.push_back(std::stold(v));
376  }
377  }
378 
379  for (const auto &vv : src.second) {
380  std::vector<long double> new_vec;
381  for (const auto &v : vv) {
382  if (std::regex_match(v, std::regex("[-+]?\\d*\\.?\\d+"))) {
383  new_vec.push_back(std::stold(v));
384  }
385  }
386  dest.second.push_back(new_vec);
387  }
388 
389  return dest;
390 }
std::pair< std::vector< long double >, std::vector< std::vector< long double > > > DataTableDouble
Definition: datatable.hpp:74

◆ str_to_double() [2/2]

DataTableDouble gpmp::core::DataTable::str_to_double ( DataTableStr  src)

Converts a DataTableStr to a DataTableDouble.

Parameters
srcA DataTableStr object to be converted
Returns
The converted DataTableDouble
Note
The function assumes the input DataTableStr contains only valid double type elements

◆ str_to_int() [1/2]

gpmp::core::DataTableInt gpmp::core::DataTable::str_to_int ( DataTableStr  src)

Converts a DataTableStr to a DataTableInt.

Parameters
srcThe DataTableStr to convert
Returns
The converted DataTableInt
Note
This function assumes that the input DataTableStr contains only elements that can be converted to a 64-bit integer using std::stoi() @TODO allow for specific columns to be converted @TODO make use of ThreadPool

Definition at line 347 of file datatable.cpp.

347  {
349 
350  for (const auto &v : src.first) {
351  // check if v contains only digits
352  if (std::regex_match(v, std::regex("\\d+"))) {
353  dest.first.push_back(std::stoi(v));
354  }
355  }
356  for (const auto &vv : src.second) {
357  std::vector<int64_t> new_vec;
358  for (const auto &v : vv) {
359  // check if v contains only digits
360  if (std::regex_match(v, std::regex("\\d+"))) {
361  new_vec.push_back(std::stoi(v));
362  }
363  }
364  dest.second.push_back(new_vec);
365  }
366  return dest;
367 }
std::pair< std::vector< int64_t >, std::vector< std::vector< int64_t > > > DataTableInt
Definition: datatable.hpp:69

◆ str_to_int() [2/2]

DataTableInt gpmp::core::DataTable::str_to_int ( DataTableStr  src)

Converts a DataTableStr to a DataTableInt.

Parameters
srcThe DataTableStr to convert
Returns
The converted DataTableInt
Note
This function assumes that the input DataTableStr contains only elements that can be converted to a 64-bit integer using std::stoi() @TODO allow for specific columns to be converted @TODO make use of ThreadPool

◆ tsv_read() [1/2]

DataTableStr gpmp::core::DataTable::tsv_read ( std::string  filename,
std::vector< std::string >  columns = {} 
)

Reads a TSV file and returns a DataTableStr parses TSV files and stores all data as strings.

Parameters
filenamethe path to the TSV file
columnsoptional vector of column names to read in, if empty all columns will be read in
Returns
a DataTableStr containing the column names and data

◆ tsv_read() [2/2]

DataTableStr gpmp::core::DataTable::tsv_read ( std::string  filename,
std::vector< std::string >  columns = {} 
)

Reads a TSV file and returns a DataTableStr parses TSV files and stores all data as strings.

Parameters
filenamethe path to the TSV file
columnsoptional vector of column names to read in, if empty all columns will be read in
Returns
a DataTableStr containing the column names and data

Member Data Documentation

◆ data_ [1/2]

std::vector<std::vector<std::string> > gpmp::core::DataTable::data_
private

Definition at line 85 of file datatable.hpp.

Referenced by csv_read(), DataTable(), display(), and printData().

◆ data_ [2/2]

MixedType gpmp::core::DataTable::data_
private

Definition at line 102 of file datatable_wip.hpp.

◆ headers_

std::vector< std::string > gpmp::core::DataTable::headers_
private

Definition at line 79 of file datatable.hpp.

Referenced by csv_read(), DataTable(), display(), and printData().

◆ new_headers_

std::vector< std::string > gpmp::core::DataTable::new_headers_
private

Definition at line 83 of file datatable.hpp.

◆ original_data_

DataTableStr gpmp::core::DataTable::original_data_
private

Definition at line 88 of file datatable.hpp.

◆ rows_ [1/2]

std::vector<std::vector<std::string> > gpmp::core::DataTable::rows_
private

Definition at line 81 of file datatable.hpp.

◆ rows_ [2/2]

MixedType gpmp::core::DataTable::rows_
private

Definition at line 98 of file datatable_wip.hpp.


The documentation for this class was generated from the following files: