Mann steht mit Tablet-Computer vor mehreren Roboterarmen

Wireless Communication Data Sets for Machine Learning

Applying machine learning techniques to wireless networks is a hot research topic. One of the biggest problems in this area is accessing a suitable data set. Because creating such data sets is expensive, reusing them is important. To provide more researchers access to data for machine learning in wireless networking we surveyed data sets. Our overview supports researchers that want to apply machine learning to wireless networking in finding a suitable data set.

Big data has many applications in wireless networking [1]. One is artificial intelligence for 5G management [2], which will supposedly also play a key role in 6G networks [3].

Wireless communication data sets range from low level physical layer measurements to social network analyses. Due to the inherent stochastic nature of wireless communication the collection of data sets always played a crucial role. For example, channel sounding has a long tradition to validate theoretical channel models in real environments and to determine the model parameters.

Increased application of signal processing techniques and advances in machine learning broadened the attention towards the analysis of higher layers. In this paper we focus on data sets suitable, for example, for network planning and traffic prediction.

Wireless Communication Data Sets

The Tables 1 + 2 provide an overview of selected data sets which might be suitable for machine learning applications in wireless networks. It is not a complete list and we have no inclusion criteria. Nevertheless, the list provides a starting point for researchers looking for data sets. We group data sets into finished one-time experiments and ongoing data-collection efforts.

Wireless Communication Data Sets
A. Finished one-time experiments

Experiments with a fixed duration have been run to determine the characteristics of wireless networks. They have been run at laboratories, locations with characteristics expected to be representative of a wide variety of locations and in locations which were expected to be exceptional. Examples are general urban areas and factories. Most data sets are generated by universities (e.g., [4]) or government agencies (e.g., NIST [5]). Data sets based on one-time experiments usually focus onlower layer (PHY) data.

Source	Time	Access	Area	Technology	Size/Entries	Data
CRAWDAD	–	Open	Depends	Depends	Depends	Depends
IEEE ComSoc	–	Open	Depends	Depends	Depends	Depends
COSINE	2020	Open	Lab	Cellular	55 GiB^(a)	SNR, …
KU Leuven	2017?	Partly	University	LTE	?	SINR, PHY info
RICE	2016?	Open	Lab	Experimental	37 GiB^(a)	Channel response matrix
NIST	2016?	Open	Factory	Low level	176 GiB	Impulse responses

Table 1: Selected data sets for wireless networks – finished one-time experiments
(a) in compressed form;

Wireless Communication Data Sets
B. Ongoing data-collection efforts

An alternative are data-collection efforts with an open time window. Such efforts are usually not based on specifically installed hardware, but turn to the public to generate data. That is, they provide software tools that allow every interested person to use their hardware to add data points to the data set. Many developed countries have agencies which support this, but also independent efforts (both commercial and non-commercial) exist. A recent white paper provides definitions, use cases and challenges for crowd-sourced measurements [6]. Data sets based on ongoing data collection usually focus onhigher layer (or application) data.

Source	Time	Access	Area	Technology	Size/Entries	Data
OECD^(b)	–	Depends	Country	Cellular	Depends	Data rate, location, …
RTR	2012 – …	Open	Austria	Cellular	1.5 x 10⁶	Data rate, latency, location
Bundesnetzagentur	2015 – …	Open	Germany	Cellular	1 x 10⁶	Data rate, latency, location
Ookla	2006 – …	Closed	Global	Cellular	30 x 10⁹	Data rate, latency, location
OpenSignal	2010 – …	Partly	Global	Cellular	Depends^(c)	Data rate, latency, location
CellMapper	2010 – …	Open	Global	Cellular	?	Base station location, configuration
WiGLE	2001 – …	Open	Global	WiFi^(d)	9 x 10⁹	Location, SSID, channel

Table 2: Selected data sets for wireless networks – ongoing data-collection efforts
(b) list of speed tests; (c) data sets > 10⁹ seem to be the norm; (d) also some data for Bluetooth and Cellular

Applications: Applying Machine Learning to Wireless Networks

Possible applications and challenges of applying machine learning (ML) to wireless networks have been discussed [7], [8], [9].

Due to the expected heterogeneity and complexity of wireless networks the traditional model-based approach in development and operation will no longer be feasible in the future [7]. Data-driven approaches will complement traditional design techniques based on mathematical models [7]. This shifts the goal of acquired data frommodel parameterization to machine learning or even continuous optimization for Self-Organzing Networks (SON). Along with Software Defined Networking (SDN) this data-driven approach enables network operatorsto apply ML-based methods in a variety of network locations: On the physical layer machine learning can be used for power control or spectrum management, or on higher layers for backhaul, cache, and resource management [9].

The training data that is needed depends on the specific application; either training data is individually generated and parametrized by the configuration or collected in an online fashion [9]. Typically, the former approach is used for lower layer problems whereas the later is used for higher layers.

Besides applications where machine learning directly interacts with the network itself, machine learning can be applied for analytic predictive tasks. The inherent structure of the collected data can be exploited by ML models to predict network metrics (e.g., data rate, latency, or reliability). These predictions span over time and space and their quality depends on the density of the data and the variability in time (primarily caused by user behavior) and space (primarily caused by network topology and geography). Technically, this is a problem of reconstructing undersampled data where the undersampling rate is defined by the data set. In its generality, a Nyquist-Shannon-based perfect unique reconstruction is impossible. Thus, due to the ability of implicitly exploiting the underlying data structure ML-based approaches might perform better in such scenarios.

Conclusion

Machine learning is already used in telecommunication, especially in the core network, and its significance will most likely grow in the near future. However, ML-based network planning and traffic prediction lags behind due to the previous lack of data sets. Selecting a suitable data set depends on the use case, but our table and the accompanying categorization supports researchers in finding such data sets.

References

[1] L. Qian, J. Zhu, and S. Zhang, “Survey of Wireless Big Data,”Journal of Communications and Information Networks, 2017.

[2] R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang, “Intelligent 5G: When cellular networks meet artificial intelligence,” IEEE Wireless communications, 2017.

[3] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Communications Magazine, 2019.

[4] D. Block, N. H. Fliedner, D. Toews, and U. Meier, “Wireless channel measurement data sets for reproducible performance evaluation in industrial environments,” in IEEE Conference on Emerging Technologies & Factory Automation, 2015.

[5] R. Candell, C. A. Remley, J. T. Quimby, D. R. Novotny, A. Curtin, P. B. Papazian, G. H. Koepke, J. Diener, and M. T. Hany, “Industrial wireless systems: Radio propagation measurements,” tech. rep., NIST, 2017.

[6] T. Hoßfeld and S. Wunderer, “White Paper on Crowdsourced Network and QoE Measurements–Definitions, Use Cases andChallenges,” doi: 10.25972/OPUS-20232, 2020.

[7] A. Zappone, M. Di Renzo, and M. Debbah, “Wireless networks design in the era of deep learning: Model-based, AI-based, or both?,” IEEE Transactions on Communications, 2019.

[8] U. Challita, H. Ryden, and H. Tullberg, “When machine learning meets wireless cellular networks: Deployment, challenges, and applications,” IEEE Communications Magazine, 2020.

[9] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao, “Application of machine learning in wireless networks: Key techniques and open issues,” IEEE Communications Surveys & Tutorials, 2019.

This research is partly funded by the Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK) and the Austrian state Salzburg.
More information: 5G-MLab & 5G-AI-MLab

See also:

Stefan Farthofer, Matthias Herlich, Christian Mier, Sabrian Pochaba, Julia Lackner, and Peter Dorfinger (2022): An Open Mobile Communications Drive Test Data Set and Its Use for Machine Learning. In: IEEE Open Journal of the Communications Society, vol. 3, pp. 1688-1701, 2022.

The capability to provide guarantees for network metrics, such as latency, data rate, and reliability will be an important factor for widespread adoption of next generation mobile networks and hence, such metrics play a central role in standards for new wireless communication technologies. However, due to the inherently stochastic nature of mobile communications, any guarantees can only be of statistical nature and are highly dependent on the actual physical environment. To analyze the stochastic behavior, this paper presents a tool chain for measurement, collection, evaluation, and prediction of controlled mobile communications drive test data. We also publish the underlying data set of measurements covering two years’ worth of highway traffic on a 25 km long section comprising 267 198 data points. We statistically evaluate the data set and validate it with a corresponding data set from another source. Applying machine learning to the data set illustrates possible use cases: Feed-forward neural networks to predict the data rate in five application scenarios, LIME to explain the behavior of the model, and an autoencoder to describe the interaction of five signal strength parameters. The data set and the tool chain show how machine learning can be applied to wireless networks and provide fellow researchers with the means to make further experiments.

Matthias Herlich, Stefan Farthofer (2020): Wireless Communication Data Sets for Machine Learning. In: 2nd KuVS Fachgespräch „Machine Learning and Networking“ in Würzburg, Germany (virtual).

Verwandte Posts:

Matthias Herlich

Matthias Herlich is a researcher in the advanced networking center at Salzburg Research. His technical expertise includes radio access networks, information-centric networking, software-defined networking, peer-to-peer networks, wireless sensor networks, and communication networks for smart grids.

Stefan Farthofer

Stefan Farthofer is a researcher at the Advanced Networking Center at Salzburg Research. His core areas are novel wireless communication systems and real-time communications. His focus lies in particular on measurement, data analysis and machine learning.

Salzburg Research Forschungsbereich(e): Intelligent Connectivity
Salzburg Research Forschungsschwerpunkt(e): Industry & Infrastructure Publiziert am 17. Dez 2020